DOC

From Bob WaterstonDavid Haussler (sections 3, 4)

By Adam Andrews,2014-05-07 10:48
7 views 0
From Bob WaterstonDavid Haussler (sections 3, 4)

    Supplementary Information for Initial Sequencing and Analysis of the

    Human Genome.

    International Human Genome Sequencing Consortium.

Methods and additional notes

    Section: Generating the draft genome sequence (p. 864)

     Subsection: Clone selection (p. 865)

    Page 866 col. 2, para.3 “Fingerprint data were reviewed ….bias against rearranged clones).

    Seed clones were picked from the growing contigs as follows: We began by

    identifying fingerprint clone contigs that had been localized to targeted locations and

    that did not contain any clones that had previously been selected for sequencing.

    Contigs were localized using mapping data from a variety of sources that could be

    attached to the fingerprinted clones, including STS/hybridization data from 86McPherson and colleagues, FISH data from several sources (C. McPherson et al., 92,95,103ref. 103), STS/PCR mapping data from several sources, electronic PCR data

    (http://www.ncbi.nlm.nih.gov/STS/) matching the BAC end sequences with mapped STSs

    and others. Beginning with the largest available clone in a valid contig (clones >250 451kb were excluded to avoid artifacts), the FPC program evaluated the fingerprints

    of all of the clones in the contig to determine largest clone for which all (but 2) of the

    individual bands in the restriction fragment pattern were common to or shared with

    (confirmed; having a band of equivalent size ?3%) with bands in the patterns of

    flanking clones (again, ignoring >250 kb flanking clones >250 kb). (Since the

    restriction enzyme used to produce the clone inserts is different than the enzyme

    used to produce the fingerprints, two bands may arise from the insert-vector junction,

    which are not found in the genome or in flanking clones.) Selected clones were then

    checked for excessive overlap with previously selected or sequenced clones and

    with each other. The allowable overlap at this stage was varied to suit the demands

    of the project.

    Clones (walking clones) extending from seed or other selected clones were selected

    as follows: In the early phases of the effort, clones were not necessarily correctly

    ordered within a fingerprint clone contig and indeed not all of the available clones

    had necessarily been incorporated into the contig. Starting with a previously

    selected (seed) clone, the FPC program compared the restriction fragment pattern of

    that clone with the patterns of all of the clones in the fingerprint database that

    overlapped with the seed clone. It then iteratively analyzed the clones identified in

    the first round of analysis to identify the additional clones that overlapped with those.

    In this way, a set of overlapping clones was identified and the clones in the set were

    ordered based on their overlap statistics. After ordering, all of the valid clones were

    identified (valid clones were defined as those with all but three of their bands

    confirmed by clones within 4 clones on either side). Any clone that also had outside

    evidence of overlap, e.g. through BAC end sequence matches or shared

    STS/hybridization data was selected for further evaluation. In cases with more than

    one clone with such outside evidence, the clone with the lowest overlap statistic (i.e.,

    the one that was least redundant) was selected (in the case of ties, the largest clone

    was favored). Where there was no outside evidence, a clone was picked based on

    evaluation of the overlaps. The candidate clone was the first one that was found to

    have the minimal overlap with the seed clone (initially <20% overlap, rising to 30% in

    later phases of the mapping effort; the percentage overlap was estimated by dividing

    the sum of the sizes of the common bands by the size of the smaller of the two

    clones). To be picked, the clone also had to be bridged to the seed clone by a third, -4) overlapped both the seed clone and the intermediate clone that confidently (<1e

    candidate clone. The candidate clone was then further evaluated for fingerprint

    overlap with previously selected or sequenced clones.

    Once clones were ordered within fingerprint clone contigs, a similar algorithm that

    exploited the known clone order was used to pick the walking clones. This algorithm

    was also adapted to pick a spanning/walking clone for complex contigs with 2 or

    more clones in the sequencing pipeline, using the fingerprint map as a guide.

     Subsection: Sequencing (p. 867)

Page 868, left-hand column, line 20: “By examining … 500 bp.”

    The sizes of the gaps between adjacent initial sequence contigs in draft clones were

    measured using alignments of the initial sequence contigs from individual draft

    clones to contigs of size ? 40 kb from overlapping clones, usually finished clones.

    10,999 gaps were examined. 1,726 gaps larger than 6,000 bp were discarded as

    probable artefacts due to misassemblies or incorrect alignments. The mean size of

    the gaps between the initial sequence contigs in draft clones was 554 bases. When

    the cutoff for discarding gaps was lowered to 3000 bp or raised to 12,000 bp, the

    mean gap size decreased to about 400 bp (estimated from 9,801 gaps) and

    increased to about 800 bp (estimated from 11,972 gaps) accordingly, indicating that

    there is still considerable uncertainty in the mean value. The 554 bp estimate for the

    mean gap size was used, along with the number of initial sequence contigs (Table 7)

    and the total number of bases in the initial sequence contigs (data not shown) to

    estimate the percentage of the draft clones that were covered by the initial sequence

    contigs. It was thus determined that, on average, about 96% of the draft clones was

    covered; assuming a mean gap size between 400 and 800 bp, the range in coverage

    is about 94-97%.

This comment also pertains to page 874, left-hand column, line 57: “Assuming that the

    sequence gaps … gaps within the draft sequenced clones”

     Subsection: Assembly of the draft genome (p. 868)

Page 868, right-hand column, l. 47, "To eliminate such problems, sequenced clones were

    associated with the fingerprint clone contigs in the physical map…"

    -7 for the sequenced clone against the fpc An FPC match statistic better than 1e

    fingerprint database was considered significant, based on empirical evidence. This

    match level was the weakest value used for placement when there was other

    confirmatory evidence to support the placement. In the absence of additional -9supportive data, a match score of better than 1e was required for placement. In

    general, only the best match was used. Other confirmatory evidence included BAC

    end matches; the BAC end sequences were obtained from NCBI (dbGSS;

    http://www.ncbi.nlm.nih.gov/dbGSS/index.html). Only BAC end sequences with 15 or fewer

    matches to the genomic sequence were used to eliminate repetitive sequences.

    Additional information used to place clones included BAC paired-end sequence

    matches, shared STS matches, and "believed" sequence overlap relationships

    determined by investigators at the NCBI and at UC-Santa Cruz. In instances in which

    the data led to conflicting placements, the data were weighted based on estimates of

    reliability. In some cases, if there was conflicting placement data or only weak data

    for placement and, according to GigAssembler, the sequenced clone failed to

    overlap any clones in the assembly at their original placement positions, a placement

    was attempted at secondary sites suggested by the placement data.

Page 869, left-hand column, line 48 “Of these 942 contigs with sequenced clones… “

    In general, merges between fingerprint clone contigs were based primarily on

    evaluation of the fingerprint data. Information about the STS map location of the

    fingerprint contigs was used to prevent spurious merges, to break spurious contigs

    and to suggest possible merges that had not been previously recognized. In addition,

    62 contigs were merged on the basis of sequence overlap information, supported by

    STS map positions.

     Subsection: Quality assessment (p. 871)

     Sub-subsection: Alignment of the fingerprint clone contigs (p. 873)

Page 873, right-hand column, line 28: “The positions of most of the STSs… about 1.7%

    differed from one or more of them."

     101We localized the STS markers from seven different physical maps (the Genethon

    and Marshfield (http://research.marshfieldclinic.org/genetics/ ) genetic maps, the 100GeneMap99, the G3 and Stanford TNG radiation hybrid maps (http://www-

    shgc.stanford.edu/Mapping/Marker/STSindex.html), and the Whitehead YAC and radiation 29hybrid map) on the draft genome sequence using e-PCR, allowing one mismatch

    per primer and the default distance constraints between primers (50 bp deviation

    from expected size of product). Only those markers that were uniquely placed on the

    draft sequence were considered. There were 62,239 such markers. Of these, 1,095,

    or 1.7%, were mapped by ePCR to a chromosome of the draft sequence that was

    different from the chromosome indicated by the information from a genetic or

    radiation hybrid map.

     Subsection: representation of random raw sequences (p. 874)

    Page 875, left-hand column, line 9: “We compared the raw sequences … using the BLAST

    computer program.”

We processed whole genome shotgun reads from four independently constructed

    libraries as follows. All reads with fewer than 300 bases of PHRED quality 20 or

    greater were removed. The remaining reads were then trimmed for vector and for

    quality, looking at the 5’ end for the first window with at least 15 continuous non-

    vector bases of >PHRED20 and at the 3’ end, starting from the left cutoff, for 12

    contiguous non-vector bases with

    had >95% of their trimmed bases with PHRED>20 and a length of >250 bases were

    kept. The reads after trimming were composed of 40% GC base pairs. Reads were

    masked for repeats using the RepeatMasker program (A.F.A. Smit & P. Green,

    http://repeatmasker.genome.washington.edu/cgi-bin/RM2_req.pl) and for low entropy data using the

    nseg option of BLAST (W. Gish, unpublished; http://blast.wustl.edu )Reads were

    retained and used only if there were at least 100 consecutive bases of PHRED

    quality 20 or greater and 100 consecutive unmasked bases.

Based on a test data set of random reads from finished projects, the following

    BLAST parameters were found to match 100% of the reads without false matches: -

    filter seg S=170 S2=150 W=13 gapW=4 gapS2=150 M=5 N=-11 Q=11 R=11. The

    set of masked trimmed reads was compared to the 7 October 7 2000 freeze of the

    HTGS data set, to all of Genbank and to the TSC SNP database using BLASTN

    2.0MP (W. Gish, unpublished; http://blast.wustl.edu). The highest scoring match was

    aligned against the read using CROSSMATCH, demanding alignment of the full

    trimmed read at ?97% identity for genomic sequence and with appropriate

    topological constraints for the SNP reads. Typically 1-2% of the matches were

    eliminated by this step.

    Page 875, left-hand column, line 30: “We found that 88% of the bases of these cDNAs

    could be aligned ...”

We aligned the RefSeq cDNA sequences to the draft genome using the psLayout 104program and gathered statistics on the percentage of cDNA bases that aligned at

    various percent identity thresholds.

The distal 200 bases of each cDNA were not included in the computation of the

    percentage of aligning bases because alignments in these regions are less reliable.

    If any cDNA aligned in more than one way, each cDNA base involved in any

    alignment was counted only once. At a threshold of 98% identity for the alignments,

    we found that 87.9% of the cDNA bases aligned somewhere in the draft genome.

    When the threshold was increased to 99% identity, the percentage of aligning bases

    fell to 85.83%, and when the threshold was decreased to 97% identity, it rose to

    88.5%. Further decreases in the threshold all the way down to 90% identity only

    increased the percentage of aligning bases one more percentage point, so the value

    of approximately 88% aligning bases, achieved by requiring 98% identity, represents

    a knee in the curve.

Section: Broad genomic landscape (p. 875)

page 876, right-hand column, line 9: “In addition, the human cytogenetic map ...”

    The locations of the cytogenetically mapped clones on the draft genome sequence

    http://genome.ucsc.edu/goldenPath/mapPlots . Further information about the can be viewed at

    individual clones can be obtained at http://www.ncbi.nlm.nih.gov/genome/cyto/ and

    http://www.ncbi.nlm.nih.gov/genome/guide. Here, as well as on the browser at

    http://genome.ucsc.edu and http://www.ensembl.org/ , they can be viewed in the context of other

    genome annotation.

     Subsection: Long-range variation in GC content (p. 876)

    Page 877, left-hand column, line 30 “About three-quarters of the genome-wide variance… consistent with a homogeneous distribution”

    All 3,312 windows of length 300 kb that had at least eight gap-free 20 kb

    subwindows and did not contain more than 50% simple repeats were extracted from

    the draft genome sequence. The average sample variance of the GC content of the

    subwindows of a window was 7.3%. The sample variance of all subwindows

    genome-wide (N = 36,562) was 27.4%. Hence, the variance of GC content within

    the 20 kb subwindows of a 300 kb window accounts for approximately one quarter of

    the overall variance of the GC content among all 20 kb subwindows in this sample.

    The average sample standard deviation of the GC content of the subwindows of a

    window was 2.4%.

Page 877, left-hand column, line 34: “In fact, the hypothesis … draft genome sequence.”

    For each of the 3,312 windows of length 300 kb, we tested the hypothesis that its 20

    kb subwindows were sampled from a homogeneous GC distribution. The distribution

    was defined to have mean m equal to the GC-content in the combined subwindows

    of the 300 kb window, and the bases were taken as independent. Under this

    distribution, the GC-content of a 20 kb subwindow would have mean m and variance 22s = m(100-m)/20000. For m = 41%, the typical value, this gives s = 0.121%, which

    is about 0.017 times the average sample variance of 7.3%. For each window, the 222variance s and the sample variance ŝ were determined, along with the value c = 22(n-1) ŝ/s, where n is the number of subwindows of the window. Under the 2hypothesis of homogeneity, the statistic c should have an approximately chi-square

    distribution with n-1 degrees of freedom. However, for every one of the 3,312 2windows, c > 31.5, which rejects the hypothesis of homogeneity with p-value >>

    0.995.

    Another way to test the hypothesis of homogeneity is to look in each 300 kb window for one 20 kb subwindow whose GC content differs significantly from the mean m for that window. In these tests, all 300 kb windows with less than 50% simple repeats and less than 25% gaps were tested (N = 10,596). Under the assumptions above, if X is the GC content of a subwindow, then D = (X-m)/sqrt[m(100-m)/20000] should have an approximately normal distribution. However, in all but four windows there is a subwindow with |D| > 3.0, i.e the GC content of the subwindow is more than 3.0 standard deviations from the mean of the window. The p-value for such a deviation is 0.0026. Considering that there are 15 possible subwindows, this gives an overall p-value of 0.039, i.e. the hypothesis of homogeneity is rejected with a p-value greater than 0.96.

    The above analysis was repeated using 5 kb subwindows of 300 kb windows, and the hypothesis of homogeneity was rejected for all windows with p-value greater than 0.96, and with greater confidence for those windows tested with the chi-square test. Similar results were also obtained for 5 kb subwindows of 100 kb windows: all but thirteen windows were rejected with p-value greater than approximately 0.95, and all but three were rejected from those examined with the chi-square test. Since any region of 200 kb must contain one of the regions of 100 kb we tested for homogeneity, this indicates that there are few if any regions of 200 kb in the genome with homogeneous GC content.

    Page 877, right-hand column, line 25: “Estimated band locations …”

    Bands were assigned by a dynamic programming algorithm that attempted to maximize the number of cytogenetically mapped clones that lie within the range of possible sub-bands predicted from FISH, with special emphasis on high-resolution 103. FISH-mapped clones provided by investigators at the National Cancer InstituteThe band positions were optimized subject to the constraint that the bands must appear in the known order along the draft genome sequence. Slight penalties for band size deviation from the standard fractional sizes were also imposed, so that in the absence of any FISH-mapped clones at all in a particular region, and given that there are no constraints from surrounding regions, the program would produce sub-bands corresponding to the standard fractional band lengths.

    Section: Repeat content of the human genome (p. 879)

     Subsection: Distribution of GC content (p. 884)

    Concerning the subdivision of the draft genome sequence into 50 kb pieces of similar GC level. The same results will be obtained however the sequence is subdivided, as long as the fragments are around 50 kb long. Specifically, however, for the analyses shown in Figures 22 to 26, the draft genome sequence was subdivided in fragments of 40-60 kb (averaging 50 kb) overlappong by 1 kb. These fragments were created on the fly by the RepeatMasker program, and for each a repeat analysis was done. The repeat information files were grouped by the GC level of the fragment, and processed according to need.

For the analyses shown in Figures 23 and 25, the number of repeat copies was

    compared. The number of individual insertions per megabase of DNA of a particular

    GC level was extracted from the RepeatMasker output (RepeatMasker provides

    information on which fragments originated from the same inserted transposable

    element). The Y axis is the ratio of the frequency of Alu (fig 23) or LINE1 (fig 25) over

    the average frequency of these elements in the genome.

    Subsection: Segmental Duplications (p. 889)

Our assessment of low copy repeats (genomic duplications) within the draft genome

    sequence involved a global analysis of all non-overlapping sequence. The analysis

    using a combination of DNA sequence analysis software and a suite of perlscripts

    developed for paralogy detection ( J. A. Bailey and E. E. Eichler, in preparation).

    The basic methodology included: repeatmasking (RepeatMasker v.4/20) of all

    reference sequences for common repeats, the removal and splicing of such repeat

    segments, global BLAST analysis of the segments for the identification of non-

    overlapping high-scoring segments, using relaxed affine gapping parameters which

    allowed large gaps up to 1 kb to be traversed (parameters: -G 180 E 1 q 80 r 30

    -z 3000000000 Y 3000000000 e 1e-10 F F)), the reintroduction of common

    repeat elements into each pairwise alignment followed by optimal global alignment

    of the segments using the program ALIGN ( E.W. Myers and W. Miller, CABIOS

    (1989) 4:11-17). To detect internal duplications within each query segment, a

    modified version of BLASTZ (W. Miller, unpublished) was used with similar relaxed

    gap parameters (B=2 M=30 I=-80 V=-80 O=180 E=1 W=14 Y=1400). Alignment

    statistics were generated (program:ALIGN_SCORER), and alignments that equaled or

    exceeded the threshold of 1000 bases aligned with over 90% similarity (i.e. gaps

    excluded) were analyzed. Generation of global alignments also acted as a

    safeguard against false positives from BLAST analysis. In cases of extremely large

    gaps (>1kb, alignments were fractured. Such cases were detected and merged for

    gaps up to 20 kb.

    Subsection: Pericentromeres and telomeres (p. 890)

Chromosome 22 (May 2000, Sanger Centre) and Chromosome 21 (Sept., NCBI)

    were analyzed for large duplications as described. For interchromosomal

    duplications, the chromosome was analyzed versus the NT accession contigs (NCBI)

    and versus all remaining HTGS accessions (draft and finished) for interchromosomal

    duplications. A final global alignment threshold, >90%; >=1000 bases, was used.

    Due to unassembled allelic overlaps, sequences containing highly similar alignments

    (>99.5% NT; >99.0% HTGS) were excluded as probable allelic overlaps. The

    duplicated sequence for chromosome 21 and chromosome 22 were graphically

    viewed using the program PARASIGHT (J. A. Bailey and E.E. Eichler, in preparation).

    Subsection: Genome-wide analysis of segmental duplications. (p. 891)

Finished sequence included all assembled sequence from NCBI within the NT

    dataset (version of 5 September 2000). A global alignment threshold (>90%; ?1000

    bases) was used for comparisons between finished sequence. Further selection limited alignments for analyses to those less than 99.5% identity, as those greater than that were likely to represent unassembled allelic overlaps.

    The 15 July 2000 version of the draft genome sequence was used as the basis for the duplication analysis of the entire human draft. A final global alignment threshold (>90%, ?1000 bases and <98%) defined the limits of detection for duplicated sequence. Sequence alignments (>98%) appear to represent mainly missed allelic overlaps many of which were subsequently merged in later releases of the assembly (e.g. 7 October 2000). Final validation of duplicated segments >98% within the

    . working draft will require finished sequence data and/or experimental validation

    Section: Gene content of the human genome (p. 892)

     Subsection: Noncoding RNAs (p. 892)

    To identify transfer RNA genes, we used tRNAscan-SE version 1.21 [T.M. Lowe, S.R. Eddy. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25,955-964 (1997)] to analyze the 7 October

    7 2000 version of the draft genome sequence. tRNAscan-SE predicted 504 tRNA genes and 144 tRNA-derived pseudogenes. Three of the predicted genes had a non-canonical anticodon loop length, preventing tRNAscan-SE from unambiguously identifying the anticodon; although there are many possible explanations for them, for our current purposes we classified these as probable pseudogenes. After manual examination of the tRNAs with unlikely anticodons, four more of the predicted genes were also classified as probable pseudogenes: a putative UAA suppressor, a putative UAG suppressor, and two putative UGA-reading selenocysteine tRNAs. The remaining gene predictions were not examined manually. We know that a small number of the 497 "true" tRNA genes are likely to be pseudogenes or parts of tRNA-derived repetitive sequence elements because tRNAscan-SE's ability to separate pseudogenes from true genes is not perfect. Because tRNAscan-SE models tRNA consensus secondary structure, it is not a reliable detector of divergent tRNA pseudogenes. To more accurately estimate the number of tRNA-derived pseudogenes, all 648 sequences detected by tRNAscan-SE were used as WU-BLASTN queries (see below), and another 173 significantly related sequences were detected, bringing the estimated pseudogene count to 324.

    To identify all ncRNA homologues other than tRNA genes, we performed sequence similarity searches using WashU BLASTN 2.0MP (W. Gishl, unpublished; http://blast.wustl.edu ) on the 7 October 2000 genome assembly, with parameters "-kap

    wordmask=seg B=50000 W=8" and the default DNA scoring matrix. True genes were operationally defined as BLAST hits with ?95% identity over ?95% the length of the query. Related sequences (e.g. pseudogenes) were operationally defined as all other BLAST hits with P-values <= 0.001. To reconcile our tRNA gene count of 497 with the larger number of 1310 generally found in textbook references, we 252reexamined the primary data in a classic paper by Hatlen and Attardi. The

    textbook estimate of 1310 human tRNA genes was based on their observation that

purified and labelled human 4S RNA (e.g. the tRNA population) hybridizes to HeLa -5 of the genome. The genomic DNA and saturates at a fraction of about 1.1x1012molecular weight of the human genome was thought at that time to be 3.1x10

    (about 4.7 billion bases). Recalculation using the current estimated genome size of

    3.2 billion bases [T.R. Tiersch, R.W. Chandler, S.S. Wachtel, S. Elias. Reference

    standards for flow cytometry and application in comparative studies of nuclear DNA

    content. Cytometry 10, 706-710 (1989); this paper] gives an estimate of 890 tRNA-complementary loci instead of 1310. Hatlen and Attardi also noted, but at the time

    could not explain, a puzzling length heterogeneity in their hybridized genomic loci.

    We believe that they were observing the tRNA pseudogene population, many of

    which are truncated copies of tRNA genes; therefore we believe their hybridization-

    based estimate of ~890 loci included tRNA pseudogenes (of which we count 324 in

    the genome) in addition to the true tRNA genes (of which we count 497 in the

    genome).

     Subsection: Protein-coding genes (p. 896)

     Sub-subsection: Exploring properties of known genes (p. 896)

Known genes were aligned with Spidey (S. Wheelan et al., manuscript in preparation)

    and Acembly (D. Thierry-Mieg and J. Thierry-Mieg, unpublished; http://www.acedb.org/ ),

    which in both cases align the cDNA to the genome while allowing for introns. The

    results from the two programs were in broad agreement. 5,364 RefSeq entroess

    (from a 1 September 2000) release were used as a source of the cDNAs. The

    alignments of the cDNAs to the genome could be classified by the proportion of the

    cDNA that aligned to the genome and by the percentage of identical nucleotides

    between the cDNA and the genomic sequence. In most cases, there was an

    unambiguous location for a cDNA. However, some proportion at each level of

    coverage had more than one site with high identity matches; in these cases, one of

    the locations was arbitrarily chosen.

     Sub-subsection: Towards a complete index of human genes (p. 898)

     Creating an initial gene index (p. 899)

Ensembl: Ensembl aims to predict coding sequences of true genes with high

    confidence, by only predicting coding sequence regions which have confirming

    evidence across their entire length. The sources of confirmation are cDNA, EST and

    protein-based similarity. The Genscan computer program was run across the

    individual fragments of the genome and the resulting peptides were used to search

    vertebrate mRNA sources (extracted from the EMBL databank;

    http://www.ebi.ac.uk/index.html), EST (vertebrate dbEST; ftp://ncbi.nlm.nih.gov/genbank ) and a

    non-redundant protein database (SWIR; http://www.ebi.ac.uk/swissprot/ ). Protein hits of

    greater than 200 bits similarity were then further processed by using the GeneWise

    program with the similar protein against the assembled draft genome sequence (the

    17 July 2000 version). A final gene-building method was then used to merge all the

    resulting information, being Genscan predictions with confirming similarity at a

    number of exons and the GeneWise gene predictions. The method only accepted a

    join between two exons if consistent similarity evidence was found on each exon with the following thresholds: (a) all GeneWise predictions were accepted, although redundant GeneWise predictions were discarded; and (b) for exons predicted by Genscan, a single protein or cDNA similarity of at least 100 bits or higher, or at least two EST hits of 100 bits or higher. This final process allows for alternative splicing, although modeling alternative splicing has not been optimised. Ensembl produced 35,500 gene predictions with 44,860 transcripts.

    Merge procedure to produce a final protein set: To generate a single protein set for further analysis we merged the known protein sequences from RefSeq (version of 29Sept2000), SWISSPROT (Release 39.6 of 30th Aug 200), TREMBL (TrEMBL Release 14.17 of 1 Oct 2000) and TREMBL_NEW (1 Oct 2000) with the gene predictions. The later protein analysis required a non-redundant protein set where genes were represented as a single protein sequence; in the case of alternative splicing, a single, representative protein sequence was required. We are aware of the obvious limitations of this representation of the human proteome, but accommodating alternative splicing in the downstream analysis was very complex.

    The genome prediction data set was prepared as follows: the Ensembl and Genie predictions were merged by examining overlap of coding exons in genomic coordinates. Two gene predictions were merged if a single coding exon on the same strand overlapped. From this set of merged predictions, we used only the Ensembl+Genie and the Ensembl-only predictions. In cases where there was more than one prediction, or for Ensembl genes, more than one transcript, we chose the longest protein sequence from each merged unit to represent the gene. The protein level merge then occurred by comparing the union of all the data sources in an all-vs-all FASTA comparison using default parameters. Two protein sequences were merged if the match covered at least 95% of the shorter sequence, and identity was ? 95%, which takes into account both nearly identical protein sequences and also nearly identical fragments.

    Special attention was needed to prevent overrepresentation of alternative splice forms. Firstly we expanded the Swissprot and Trembl databases to represent known splice variants in the protein merge, but only took a single protein (the canonical database sequence) for the final protein set. An additional cull for alternative splice forms which remained as separate proteins was produced by taking the corresponding DNA sequences of the known proteins (RefSeq, SWISSPROT, TREMBL and TREMBL_NEW) and matching back to the genome using the SSAHA program without requiring a valid gene structure alignment. If the DNA derived from two protein sequences matched at over 28 base pairs at the same location, the longest protein sequence was used. Finally, clear bacterial contamination (proteins which had an almost identical match to a bacterial protein) were removed.

    Quality Control on the protein set: We took 31 genes which we could confirm as being unavailable at the time of the gene builds (22 from RefSeq, 9 from the Sanger Centre gene identification program on chromosome X). 3 of the 31 sequences could

Report this document

For any questions or suggestions please email
cust-service@docsford.com