Supplementary Information for Initial Sequencing and Analysis of the
International Human Genome Sequencing Consortium.
Methods and additional notes
Section: Generating the draft genome sequence (p. 864)
Subsection: Clone selection (p. 865)
Page 866 col. 2, para.3 “Fingerprint data were reviewed ….bias against rearranged clones).
Seed clones were picked from the growing contigs as follows: We began by
identifying fingerprint clone contigs that had been localized to targeted locations and
that did not contain any clones that had previously been selected for sequencing.
Contigs were localized using mapping data from a variety of sources that could be
attached to the fingerprinted clones, including STS/hybridization data from 86McPherson and colleagues, FISH data from several sources (C. McPherson et al., 92,95,103ref. 103), STS/PCR mapping data from several sources, electronic PCR data
(http://www.ncbi.nlm.nih.gov/STS/) matching the BAC end sequences with mapped STSs
and others. Beginning with the largest available clone in a valid contig (clones >250 451kb were excluded to avoid artifacts), the FPC program evaluated the fingerprints
of all of the clones in the contig to determine largest clone for which all (but 2) of the
individual bands in the restriction fragment pattern were common to or shared with
(confirmed; having a band of equivalent size ?3%) with bands in the patterns of
flanking clones (again, ignoring >250 kb flanking clones >250 kb). (Since the
restriction enzyme used to produce the clone inserts is different than the enzyme
used to produce the fingerprints, two bands may arise from the insert-vector junction,
which are not found in the genome or in flanking clones.) Selected clones were then
checked for excessive overlap with previously selected or sequenced clones and
with each other. The allowable overlap at this stage was varied to suit the demands
of the project.
Clones (walking clones) extending from seed or other selected clones were selected
as follows: In the early phases of the effort, clones were not necessarily correctly
ordered within a fingerprint clone contig and indeed not all of the available clones
had necessarily been incorporated into the contig. Starting with a previously
selected (seed) clone, the FPC program compared the restriction fragment pattern of
that clone with the patterns of all of the clones in the fingerprint database that
overlapped with the seed clone. It then iteratively analyzed the clones identified in
the first round of analysis to identify the additional clones that overlapped with those.
In this way, a set of overlapping clones was identified and the clones in the set were
ordered based on their overlap statistics. After ordering, all of the valid clones were
identified (valid clones were defined as those with all but three of their bands
confirmed by clones within 4 clones on either side). Any clone that also had outside
evidence of overlap, e.g. through BAC end sequence matches or shared
STS/hybridization data was selected for further evaluation. In cases with more than
one clone with such outside evidence, the clone with the lowest overlap statistic (i.e.,
the one that was least redundant) was selected (in the case of ties, the largest clone
was favored). Where there was no outside evidence, a clone was picked based on
evaluation of the overlaps. The candidate clone was the first one that was found to
have the minimal overlap with the seed clone (initially <20% overlap, rising to 30% in
later phases of the mapping effort; the percentage overlap was estimated by dividing
the sum of the sizes of the common bands by the size of the smaller of the two
clones). To be picked, the clone also had to be bridged to the seed clone by a third, -4) overlapped both the seed clone and the intermediate clone that confidently (<1e
candidate clone. The candidate clone was then further evaluated for fingerprint
overlap with previously selected or sequenced clones.
Once clones were ordered within fingerprint clone contigs, a similar algorithm that
exploited the known clone order was used to pick the walking clones. This algorithm
was also adapted to pick a spanning/walking clone for complex contigs with 2 or
more clones in the sequencing pipeline, using the fingerprint map as a guide.
Subsection: Sequencing (p. 867)
Page 868, left-hand column, line 20: “By examining … 500 bp.”
The sizes of the gaps between adjacent initial sequence contigs in draft clones were
measured using alignments of the initial sequence contigs from individual draft
clones to contigs of size ? 40 kb from overlapping clones, usually finished clones.
10,999 gaps were examined. 1,726 gaps larger than 6,000 bp were discarded as
probable artefacts due to misassemblies or incorrect alignments. The mean size of
the gaps between the initial sequence contigs in draft clones was 554 bases. When
the cutoff for discarding gaps was lowered to 3000 bp or raised to 12,000 bp, the
mean gap size decreased to about 400 bp (estimated from 9,801 gaps) and
increased to about 800 bp (estimated from 11,972 gaps) accordingly, indicating that
there is still considerable uncertainty in the mean value. The 554 bp estimate for the
mean gap size was used, along with the number of initial sequence contigs (Table 7)
and the total number of bases in the initial sequence contigs (data not shown) to
estimate the percentage of the draft clones that were covered by the initial sequence
contigs. It was thus determined that, on average, about 96% of the draft clones was
covered; assuming a mean gap size between 400 and 800 bp, the range in coverage
is about 94-97%.
This comment also pertains to page 874, left-hand column, line 57: “Assuming that the
sequence gaps … gaps within the draft sequenced clones”
Subsection: Assembly of the draft genome (p. 868)
Page 868, right-hand column, l. 47, "To eliminate such problems, sequenced clones were
associated with the fingerprint clone contigs in the physical map…"
-7 for the sequenced clone against the fpc An FPC match statistic better than 1e
fingerprint database was considered significant, based on empirical evidence. This
match level was the weakest value used for placement when there was other
confirmatory evidence to support the placement. In the absence of additional -9supportive data, a match score of better than 1e was required for placement. In
general, only the best match was used. Other confirmatory evidence included BAC
end matches; the BAC end sequences were obtained from NCBI (dbGSS;
http://www.ncbi.nlm.nih.gov/dbGSS/index.html). Only BAC end sequences with 15 or fewer
matches to the genomic sequence were used to eliminate repetitive sequences.
Additional information used to place clones included BAC paired-end sequence
matches, shared STS matches, and "believed" sequence overlap relationships
determined by investigators at the NCBI and at UC-Santa Cruz. In instances in which
the data led to conflicting placements, the data were weighted based on estimates of
reliability. In some cases, if there was conflicting placement data or only weak data
for placement and, according to GigAssembler, the sequenced clone failed to
overlap any clones in the assembly at their original placement positions, a placement
was attempted at secondary sites suggested by the placement data.
Page 869, left-hand column, line 48 “Of these 942 contigs with sequenced clones… “
In general, merges between fingerprint clone contigs were based primarily on
evaluation of the fingerprint data. Information about the STS map location of the
fingerprint contigs was used to prevent spurious merges, to break spurious contigs
and to suggest possible merges that had not been previously recognized. In addition,
62 contigs were merged on the basis of sequence overlap information, supported by
STS map positions.
Subsection: Quality assessment (p. 871)
Sub-subsection: Alignment of the fingerprint clone contigs (p. 873)
Page 873, right-hand column, line 28: “The positions of most of the STSs… about 1.7%
differed from one or more of them."
101We localized the STS markers from seven different physical maps (the Genethon
and Marshfield (http://research.marshfieldclinic.org/genetics/ ) genetic maps, the 100GeneMap99, the G3 and Stanford TNG radiation hybrid maps (http://www-
shgc.stanford.edu/Mapping/Marker/STSindex.html), and the Whitehead YAC and radiation 29hybrid map) on the draft genome sequence using e-PCR, allowing one mismatch
per primer and the default distance constraints between primers (50 bp deviation
from expected size of product). Only those markers that were uniquely placed on the
draft sequence were considered. There were 62,239 such markers. Of these, 1,095,
or 1.7%, were mapped by ePCR to a chromosome of the draft sequence that was
different from the chromosome indicated by the information from a genetic or
radiation hybrid map.
Subsection: representation of random raw sequences (p. 874)
Page 875, left-hand column, line 9: “We compared the raw sequences … using the BLAST
We processed whole genome shotgun reads from four independently constructed
libraries as follows. All reads with fewer than 300 bases of PHRED quality 20 or
greater were removed. The remaining reads were then trimmed for vector and for
quality, looking at the 5’ end for the first window with at least 15 continuous non-
vector bases of >PHRED20 and at the 3’ end, starting from the left cutoff, for 12
contiguous non-vector bases with
had >95% of their trimmed bases with PHRED>20 and a length of >250 bases were
kept. The reads after trimming were composed of 40% GC base pairs. Reads were
masked for repeats using the RepeatMasker program (A.F.A. Smit & P. Green,
http://repeatmasker.genome.washington.edu/cgi-bin/RM2_req.pl) and for low entropy data using the
nseg option of BLAST (W. Gish, unpublished; http://blast.wustl.edu )Reads were
retained and used only if there were at least 100 consecutive bases of PHRED
quality 20 or greater and 100 consecutive unmasked bases.
Based on a test data set of random reads from finished projects, the following
BLAST parameters were found to match 100% of the reads without false matches: -
filter seg S=170 S2=150 W=13 gapW=4 gapS2=150 M=5 N=-11 Q=11 R=11. The
set of masked trimmed reads was compared to the 7 October 7 2000 freeze of the
HTGS data set, to all of Genbank and to the TSC SNP database using BLASTN
2.0MP (W. Gish, unpublished; http://blast.wustl.edu). The highest scoring match was
aligned against the read using CROSSMATCH, demanding alignment of the full
trimmed read at ?97% identity for genomic sequence and with appropriate
topological constraints for the SNP reads. Typically 1-2% of the matches were
eliminated by this step.
Page 875, left-hand column, line 30: “We found that 88% of the bases of these cDNAs
could be aligned ...”
We aligned the RefSeq cDNA sequences to the draft genome using the psLayout 104program and gathered statistics on the percentage of cDNA bases that aligned at
various percent identity thresholds.
The distal 200 bases of each cDNA were not included in the computation of the
percentage of aligning bases because alignments in these regions are less reliable.
If any cDNA aligned in more than one way, each cDNA base involved in any
alignment was counted only once. At a threshold of 98% identity for the alignments,
we found that 87.9% of the cDNA bases aligned somewhere in the draft genome.
When the threshold was increased to 99% identity, the percentage of aligning bases
fell to 85.83%, and when the threshold was decreased to 97% identity, it rose to
88.5%. Further decreases in the threshold all the way down to 90% identity only
increased the percentage of aligning bases one more percentage point, so the value
of approximately 88% aligning bases, achieved by requiring 98% identity, represents
a knee in the curve.
Section: Broad genomic landscape (p. 875)
page 876, right-hand column, line 9: “In addition, the human cytogenetic map ...”
The locations of the cytogenetically mapped clones on the draft genome sequence
http://genome.ucsc.edu/goldenPath/mapPlots . Further information about the can be viewed at
individual clones can be obtained at http://www.ncbi.nlm.nih.gov/genome/cyto/ and
http://www.ncbi.nlm.nih.gov/genome/guide. Here, as well as on the browser at
http://genome.ucsc.edu and http://www.ensembl.org/ , they can be viewed in the context of other
Subsection: Long-range variation in GC content (p. 876)
Page 877, left-hand column, line 30 “About three-quarters of the genome-wide variance… consistent with a homogeneous distribution”
All 3,312 windows of length 300 kb that had at least eight gap-free 20 kb
subwindows and did not contain more than 50% simple repeats were extracted from
the draft genome sequence. The average sample variance of the GC content of the
subwindows of a window was 7.3%. The sample variance of all subwindows
genome-wide (N = 36,562) was 27.4%. Hence, the variance of GC content within
the 20 kb subwindows of a 300 kb window accounts for approximately one quarter of
the overall variance of the GC content among all 20 kb subwindows in this sample.
The average sample standard deviation of the GC content of the subwindows of a
window was 2.4%.
Page 877, left-hand column, line 34: “In fact, the hypothesis … draft genome sequence.”
For each of the 3,312 windows of length 300 kb, we tested the hypothesis that its 20
kb subwindows were sampled from a homogeneous GC distribution. The distribution
was defined to have mean m equal to the GC-content in the combined subwindows
of the 300 kb window, and the bases were taken as independent. Under this
distribution, the GC-content of a 20 kb subwindow would have mean m and variance 22s = m(100-m)/20000. For m = 41%, the typical value, this gives s = 0.121%, which
is about 0.017 times the average sample variance of 7.3%. For each window, the 222variance s and the sample variance ŝ were determined, along with the value c = 22(n-1) ŝ/s, where n is the number of subwindows of the window. Under the 2hypothesis of homogeneity, the statistic c should have an approximately chi-square
distribution with n-1 degrees of freedom. However, for every one of the 3,312 2windows, c > 31.5, which rejects the hypothesis of homogeneity with p-value >>
Another way to test the hypothesis of homogeneity is to look in each 300 kb window for one 20 kb subwindow whose GC content differs significantly from the mean m for that window. In these tests, all 300 kb windows with less than 50% simple repeats and less than 25% gaps were tested (N = 10,596). Under the assumptions above, if X is the GC content of a subwindow, then D = (X-m)/sqrt[m(100-m)/20000] should have an approximately normal distribution. However, in all but four windows there is a subwindow with |D| > 3.0, i.e the GC content of the subwindow is more than 3.0 standard deviations from the mean of the window. The p-value for such a deviation is 0.0026. Considering that there are 15 possible subwindows, this gives an overall p-value of 0.039, i.e. the hypothesis of homogeneity is rejected with a p-value greater than 0.96.
The above analysis was repeated using 5 kb subwindows of 300 kb windows, and the hypothesis of homogeneity was rejected for all windows with p-value greater than 0.96, and with greater confidence for those windows tested with the chi-square test. Similar results were also obtained for 5 kb subwindows of 100 kb windows: all but thirteen windows were rejected with p-value greater than approximately 0.95, and all but three were rejected from those examined with the chi-square test. Since any region of 200 kb must contain one of the regions of 100 kb we tested for homogeneity, this indicates that there are few if any regions of 200 kb in the genome with homogeneous GC content.
Page 877, right-hand column, line 25: “Estimated band locations …”
Bands were assigned by a dynamic programming algorithm that attempted to maximize the number of cytogenetically mapped clones that lie within the range of possible sub-bands predicted from FISH, with special emphasis on high-resolution 103. FISH-mapped clones provided by investigators at the National Cancer InstituteThe band positions were optimized subject to the constraint that the bands must appear in the known order along the draft genome sequence. Slight penalties for band size deviation from the standard fractional sizes were also imposed, so that in the absence of any FISH-mapped clones at all in a particular region, and given that there are no constraints from surrounding regions, the program would produce sub-bands corresponding to the standard fractional band lengths.
Section: Repeat content of the human genome (p. 879)
Subsection: Distribution of GC content (p. 884)
Concerning the subdivision of the draft genome sequence into 50 kb pieces of similar GC level. The same results will be obtained however the sequence is subdivided, as long as the fragments are around 50 kb long. Specifically, however, for the analyses shown in Figures 22 to 26, the draft genome sequence was subdivided in fragments of 40-60 kb (averaging 50 kb) overlappong by 1 kb. These fragments were created on the fly by the RepeatMasker program, and for each a repeat analysis was done. The repeat information files were grouped by the GC level of the fragment, and processed according to need.
For the analyses shown in Figures 23 and 25, the number of repeat copies was
compared. The number of individual insertions per megabase of DNA of a particular
GC level was extracted from the RepeatMasker output (RepeatMasker provides
information on which fragments originated from the same inserted transposable
element). The Y axis is the ratio of the frequency of Alu (fig 23) or LINE1 (fig 25) over
the average frequency of these elements in the genome.
Subsection: Segmental Duplications (p. 889)
Our assessment of low copy repeats (genomic duplications) within the draft genome
sequence involved a global analysis of all non-overlapping sequence. The analysis
using a combination of DNA sequence analysis software and a suite of perlscripts
developed for paralogy detection ( J. A. Bailey and E. E. Eichler, in preparation).
The basic methodology included: repeatmasking (RepeatMasker v.4/20) of all
reference sequences for common repeats, the removal and splicing of such repeat
segments, global BLAST analysis of the segments for the identification of non-
overlapping high-scoring segments, using relaxed affine gapping parameters which
allowed large gaps up to 1 kb to be traversed (parameters: -G 180 –E 1 –q –80 –r 30
-z 3000000000 –Y 3000000000 –e 1e-10 –F F)), the reintroduction of common
repeat elements into each pairwise alignment followed by optimal global alignment
of the segments using the program ALIGN ( E.W. Myers and W. Miller, CABIOS
(1989) 4:11-17). To detect internal duplications within each query segment, a
modified version of BLASTZ (W. Miller, unpublished) was used with similar relaxed
gap parameters (B=2 M=30 I=-80 V=-80 O=180 E=1 W=14 Y=1400). Alignment
statistics were generated (program:ALIGN_SCORER), and alignments that equaled or
exceeded the threshold of 1000 bases aligned with over 90% similarity (i.e. gaps
excluded) were analyzed. Generation of global alignments also acted as a
safeguard against false positives from BLAST analysis. In cases of extremely large
gaps (>1kb, alignments were fractured. Such cases were detected and merged for
gaps up to 20 kb.
Subsection: Pericentromeres and telomeres (p. 890)
Chromosome 22 (May 2000, Sanger Centre) and Chromosome 21 (Sept., NCBI)
were analyzed for large duplications as described. For interchromosomal
duplications, the chromosome was analyzed versus the NT accession contigs (NCBI)
and versus all remaining HTGS accessions (draft and finished) for interchromosomal
duplications. A final global alignment threshold, >90%; >=1000 bases, was used.
Due to unassembled allelic overlaps, sequences containing highly similar alignments
(>99.5% NT; >99.0% HTGS) were excluded as probable allelic overlaps. The
duplicated sequence for chromosome 21 and chromosome 22 were graphically
viewed using the program PARASIGHT (J. A. Bailey and E.E. Eichler, in preparation).
Subsection: Genome-wide analysis of segmental duplications. (p. 891)
Finished sequence included all assembled sequence from NCBI within the NT
dataset (version of 5 September 2000). A global alignment threshold (>90%; ?1000
bases) was used for comparisons between finished sequence. Further selection limited alignments for analyses to those less than 99.5% identity, as those greater than that were likely to represent unassembled allelic overlaps.
The 15 July 2000 version of the draft genome sequence was used as the basis for the duplication analysis of the entire human draft. A final global alignment threshold (>90%, ?1000 bases and <98%) defined the limits of detection for duplicated sequence. Sequence alignments (>98%) appear to represent mainly missed allelic overlaps many of which were subsequently merged in later releases of the assembly (e.g. 7 October 2000). Final validation of duplicated segments >98% within the
. working draft will require finished sequence data and/or experimental validation
Section: Gene content of the human genome (p. 892)
Subsection: Noncoding RNAs (p. 892)
To identify transfer RNA genes, we used tRNAscan-SE version 1.21 [T.M. Lowe, S.R. Eddy. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25,955-964 (1997)] to analyze the 7 October
7 2000 version of the draft genome sequence. tRNAscan-SE predicted 504 tRNA genes and 144 tRNA-derived pseudogenes. Three of the predicted genes had a non-canonical anticodon loop length, preventing tRNAscan-SE from unambiguously identifying the anticodon; although there are many possible explanations for them, for our current purposes we classified these as probable pseudogenes. After manual examination of the tRNAs with unlikely anticodons, four more of the predicted genes were also classified as probable pseudogenes: a putative UAA suppressor, a putative UAG suppressor, and two putative UGA-reading selenocysteine tRNAs. The remaining gene predictions were not examined manually. We know that a small number of the 497 "true" tRNA genes are likely to be pseudogenes or parts of tRNA-derived repetitive sequence elements because tRNAscan-SE's ability to separate pseudogenes from true genes is not perfect. Because tRNAscan-SE models tRNA consensus secondary structure, it is not a reliable detector of divergent tRNA pseudogenes. To more accurately estimate the number of tRNA-derived pseudogenes, all 648 sequences detected by tRNAscan-SE were used as WU-BLASTN queries (see below), and another 173 significantly related sequences were detected, bringing the estimated pseudogene count to 324.
To identify all ncRNA homologues other than tRNA genes, we performed sequence similarity searches using WashU BLASTN 2.0MP (W. Gishl, unpublished; http://blast.wustl.edu ) on the 7 October 2000 genome assembly, with parameters "-kap
wordmask=seg B=50000 W=8" and the default DNA scoring matrix. True genes were operationally defined as BLAST hits with ?95% identity over ?95% the length of the query. Related sequences (e.g. pseudogenes) were operationally defined as all other BLAST hits with P-values <= 0.001. To reconcile our tRNA gene count of 497 with the larger number of 1310 generally found in textbook references, we 252reexamined the primary data in a classic paper by Hatlen and Attardi. The
textbook estimate of 1310 human tRNA genes was based on their observation that
purified and labelled human 4S RNA (e.g. the tRNA population) hybridizes to HeLa -5 of the genome. The genomic DNA and saturates at a fraction of about 1.1x1012molecular weight of the human genome was thought at that time to be 3.1x10
(about 4.7 billion bases). Recalculation using the current estimated genome size of
3.2 billion bases [T.R. Tiersch, R.W. Chandler, S.S. Wachtel, S. Elias. Reference
standards for flow cytometry and application in comparative studies of nuclear DNA
content. Cytometry 10, 706-710 (1989); this paper] gives an estimate of 890 tRNA-complementary loci instead of 1310. Hatlen and Attardi also noted, but at the time
could not explain, a puzzling length heterogeneity in their hybridized genomic loci.
We believe that they were observing the tRNA pseudogene population, many of
which are truncated copies of tRNA genes; therefore we believe their hybridization-
based estimate of ~890 loci included tRNA pseudogenes (of which we count 324 in
the genome) in addition to the true tRNA genes (of which we count 497 in the
Subsection: Protein-coding genes (p. 896)
Sub-subsection: Exploring properties of known genes (p. 896)
Known genes were aligned with Spidey (S. Wheelan et al., manuscript in preparation)
and Acembly (D. Thierry-Mieg and J. Thierry-Mieg, unpublished; http://www.acedb.org/ ),
which in both cases align the cDNA to the genome while allowing for introns. The
results from the two programs were in broad agreement. 5,364 RefSeq entroess
(from a 1 September 2000) release were used as a source of the cDNAs. The
alignments of the cDNAs to the genome could be classified by the proportion of the
cDNA that aligned to the genome and by the percentage of identical nucleotides
between the cDNA and the genomic sequence. In most cases, there was an
unambiguous location for a cDNA. However, some proportion at each level of
coverage had more than one site with high identity matches; in these cases, one of
the locations was arbitrarily chosen.
Sub-subsection: Towards a complete index of human genes (p. 898)
Creating an initial gene index (p. 899)
Ensembl: Ensembl aims to predict coding sequences of true genes with high
confidence, by only predicting coding sequence regions which have confirming
evidence across their entire length. The sources of confirmation are cDNA, EST and
protein-based similarity. The Genscan computer program was run across the
individual fragments of the genome and the resulting peptides were used to search
vertebrate mRNA sources (extracted from the EMBL databank;
http://www.ebi.ac.uk/index.html), EST (vertebrate dbEST; ftp://ncbi.nlm.nih.gov/genbank ) and a
non-redundant protein database (SWIR; http://www.ebi.ac.uk/swissprot/ ). Protein hits of
greater than 200 bits similarity were then further processed by using the GeneWise
program with the similar protein against the assembled draft genome sequence (the
17 July 2000 version). A final gene-building method was then used to merge all the
resulting information, being Genscan predictions with confirming similarity at a
number of exons and the GeneWise gene predictions. The method only accepted a
join between two exons if consistent similarity evidence was found on each exon with the following thresholds: (a) all GeneWise predictions were accepted, although redundant GeneWise predictions were discarded; and (b) for exons predicted by Genscan, a single protein or cDNA similarity of at least 100 bits or higher, or at least two EST hits of 100 bits or higher. This final process allows for alternative splicing, although modeling alternative splicing has not been optimised. Ensembl produced 35,500 gene predictions with 44,860 transcripts.
Merge procedure to produce a final protein set: To generate a single protein set for further analysis we merged the known protein sequences from RefSeq (version of 29Sept2000), SWISSPROT (Release 39.6 of 30th Aug 200), TREMBL (TrEMBL Release 14.17 of 1 Oct 2000) and TREMBL_NEW (1 Oct 2000) with the gene predictions. The later protein analysis required a non-redundant protein set where genes were represented as a single protein sequence; in the case of alternative splicing, a single, representative protein sequence was required. We are aware of the obvious limitations of this representation of the human proteome, but accommodating alternative splicing in the downstream analysis was very complex.
The genome prediction data set was prepared as follows: the Ensembl and Genie predictions were merged by examining overlap of coding exons in genomic coordinates. Two gene predictions were merged if a single coding exon on the same strand overlapped. From this set of merged predictions, we used only the Ensembl+Genie and the Ensembl-only predictions. In cases where there was more than one prediction, or for Ensembl genes, more than one transcript, we chose the longest protein sequence from each merged unit to represent the gene. The protein level merge then occurred by comparing the union of all the data sources in an all-vs-all FASTA comparison using default parameters. Two protein sequences were merged if the match covered at least 95% of the shorter sequence, and identity was ? 95%, which takes into account both nearly identical protein sequences and also nearly identical fragments.
Special attention was needed to prevent overrepresentation of alternative splice forms. Firstly we expanded the Swissprot and Trembl databases to represent known splice variants in the protein merge, but only took a single protein (the canonical database sequence) for the final protein set. An additional cull for alternative splice forms which remained as separate proteins was produced by taking the corresponding DNA sequences of the known proteins (RefSeq, SWISSPROT, TREMBL and TREMBL_NEW) and matching back to the genome using the SSAHA program without requiring a valid gene structure alignment. If the DNA derived from two protein sequences matched at over 28 base pairs at the same location, the longest protein sequence was used. Finally, clear bacterial contamination (proteins which had an almost identical match to a bacterial protein) were removed.
Quality Control on the protein set: We took 31 genes which we could confirm as being unavailable at the time of the gene builds (22 from RefSeq, 9 from the Sanger Centre gene identification program on chromosome X). 3 of the 31 sequences could