A novel algorithm for computational identification of contaminated EST libraries.
ABSTRACT: A key goal of the Human Genome Project was to understand the complete set of human proteins, the proteome. Since the genome sequence by itself is not sufficient for predicting new genes and alternative splicing events that lead to new proteins, expressed sequence tags (ESTs) are used as the primary tool for these purposes. The high prevalence of artifacts in dbEST, however, often leads to invalid predictions. Here we describe a novel method for recognizing genomic DNA contamination and other artifacts that cannot be identified using current EST cleaning techniques. Our method uses the alignment of the entire set of ESTs to the human genome to identify highly contaminated EST libraries. We discovered 53 highly contaminated libraries and a subset of 24 766 ESTs from these libraries that probably represent contamination with genomic DNA, pre-mRNA, and ESTs that span non-canonical introns. Although this is only a small fraction of the entire EST dataset, each contaminating sequence could create a spurious transcript prediction. Indeed, in the clustering and assembly tool that we used, these sequences would have caused incorrect inference of 9575 new splice variants and 6370 new genes. Conclusions based on EST analysis, including prediction of alternative splicing, should be re-evaluated in light of these results. Our method, along with the identified set of contaminated sequences, will be essential for applications that depend on large EST datasets.
Project description:Alternate polyadenylation affects a large fraction of higher eucaryote mRNAs, producing mature transcripts with 3' ends of variable length. This variation is poorly represented in the current transcript catalogs derived from whole genome sequences, mostly because such posttranscriptional events are not detectable directly at the DNA level. Alternate polyadenylation of an mRNA is better understood by comparison to EST databases. Comparing ESTs to mRNAs, however, is a difficult task subjected to the pitfalls of internal priming, presence of intron sequences, repeated elements, chimerical ESTs or matches with EST from paralogous genes. We present here a computer program that addresses these problems and displays ESTs matches to a query mRNA sequence to predict alternate polyadenylation and to suggest library-specific forms. The output highlights effective polyadenylation signals, possible sources of artifacts such as A-rich stretches in the mRNA sequences, and allows for a direct visualization of EST libraries using color codes. Statistical biases in the distribution of alternative mRNA forms among EST libraries were systematically sought. About 1450 human and 200 mouse mRNAs displayed such biases, suggesting in each case a tissue- or disease-specific regulation of polyadenylation.
Project description:The zebrafish is a powerful system for understanding the vertebrate genome, allowing the combination of genetic, molecular, and embryological analysis. Expressed sequence tags (ESTs) provide a rapid means of identifying an organism's genes for further analysis, but any EST project is limited by the availability of suitable libraries. Such cDNA libraries must be of high quality and provide a high rate of gene discovery. However, commonly used normalization and subtraction procedures tend to select for shorter, truncated, and internally primed inserts, seriously affecting library quality. An alternative procedure is to use oligonucleotide fingerprinting (OFP) to precluster clones before EST sequencing, thereby reducing the re-sequencing of common transcripts. Here, we describe the use of OFP to normalize and subtract 75,000 clones from two cDNA libraries, to a minimal set of 25,102 clones. We generated 25,788 ESTs (11,380 3' and 14,408 5') from over 16,000 of these clones. Clustering of 10,654 high-quality 3' ESTs from this set identified 7232 clusters (likely genes), corresponding to a 68% gene diversity rate, comparable to what has been reported for the best normalized human cDNA libraries, and indicating that the complete set of 25,102 clones contains as many as 17,000 genes. Yet, the library quality remains high. The complete set of 25,102 clones is available for researchers as glycerol stocks, filters sets, and as individual EST clones. These resources have been used for radiation hybrid, genetic, and physical mapping of the zebrafish genome, as well as positional cloning and candidate gene identification, molecular marker, and microarray development.
Project description:To build a foundation for the complete genome analysis of Bombyx mori, we have constructed an EST database. Because gene expression patterns deeply depend on tissues as well as developmental stages, we analyzed many cDNA libraries prepared from various tissues and different developmental stages to cover the entire set of Bombyx genes. So far, the Bombyx EST database contains 35,000 ESTs from 36 cDNA libraries, which are grouped into approximately 11,000 nonredundant ESTs with the average length of 1.25 kb. The comparison with FlyBase suggests that the present EST database, SilkBase, covers >55% of all genes of Bombyx. The fraction of library-specific ESTs in each cDNA library indicates that we have not yet reached saturation, showing the validity of our strategy for constructing an EST database to cover all genes. To tackle the coming saturation problem, we have checked two methods, subtraction and normalization, to increase coverage and decrease the number of housekeeping genes, resulting in a 5-11% increase of library-specific ESTs. The identification of a number of genes and comprehensive cloning of gene families have already emerged from the SilkBase search. Direct links of SilkBase with FlyBase and WormBase provide ready identification of candidate Lepidoptera-specific genes.
Project description:BACKGROUND: EST libraries are used in various biological studies, from microarray experiments to proteomic and genetic screens. These libraries usually contain many uncharacterized ESTs that are typically ignored since they cannot be mapped to known genes. Consequently, new discoveries are possibly overlooked. RESULTS: We describe a system (EST2Prot) that uses multiple elements to map EST sequences to their corresponding protein products. EST2Prot uses UniGene clusters, substring analysis, information about protein coding regions in existing DNA sequences and protein database searches to detect protein products related to a query EST sequence. Gene Ontology terms, Swiss-Prot keywords, and protein similarity data are used to map the ESTs to functional descriptors. CONCLUSION: EST2Prot extends and significantly enriches the popular UniGene mapping by utilizing multiple relations between known biological entities. It produces a mapping between ESTs and proteins in real-time through a simple web-interface. The system is part of the Biozon database and is accessible at http://biozon.org/tools/est/.
Project description:BACKGROUND: Alternative splicing contributes significantly to the complexity of the human transcriptome and proteome. Computational prediction of alternative splice isoforms are usually based on EST sequences that also allow to approximate the expression pattern of the related transcripts. However, the limited number of tissues represented in the EST data as well as the different cDNA construction protocols may influence the predictive capacity of ESTs to unravel tissue-specifically expressed transcripts. METHODS: We predict tissue and tumor specific splice isoforms based on the genomic mapping (SpliceNest) of the EST consensus sequences and library annotation provided in the GeneNest database. We further ascertain the potentially rare tissue specific transcripts as the ones represented only by ESTs derived from normalized libraries. A subset of the predicted tissue and tumor specific isoforms are then validated via RT-PCR experiments over a spectrum of 40 tissue types. RESULTS: Our strategy revealed 427 genes with at least one tissue specific transcript as well as 1120 genes showing tumor specific isoforms. While our experimental evaluation of computationally predicted tissue-specific isoforms revealed a high success rate in confirming the expression of these isoforms in the respective tissue, the strategy frequently failed to detect the expected restricted expression pattern. The analysis of putative lowly expressed transcripts using normalized cDNA libraries suggests that our ability to detect tissue-specific isoforms strongly depends on the expression level of the respective transcript as well as on the sensitivity of the experimental methods. Especially splice isoforms predicted to be disease-specific tend to represent transcripts that are expressed in a set of healthy tissues rather than novel isoforms. CONCLUSIONS: We propose to combine the computational prediction of alternative splice isoforms with experimental validation for efficient delineation of an accurate set of tissue-specific transcripts.
Project description:The crop expressed sequence tag database, CR-EST (http://pgrc.ipk-gatersleben.de/cr-est/), is a publicly available online resource providing access to sequence, classification, clustering and annotation data of crop EST projects. CR-EST currently holds more than 200,000 sequences derived from 41 cDNA libraries of four species: barley, wheat, pea and potato. The barley section comprises approximately one-third of all publicly available ESTs. CR-EST deploys an automatic EST preparation pipeline that includes the identification of chimeric clones in order to transparently display the data quality. Sequences are clustered in species-specific projects to currently generate a non-redundant set of approximately 22,600 consensus sequences and approximately 17,200 singletons, which form the basis of the provided set of unigenes. A web application allows the user to compute BLAST alignments of query sequences against the CR-EST database, query data from Gene Ontology and metabolic pathway annotations and query sequence similarities from stored BLAST results. CR-EST also features interactive JAVA-based tools, allowing the visualization of open reading frames and the explorative analysis of Gene Ontology mappings applied to ESTs.
Project description:BACKGROUND: Ustilago maydis is the basidiomycete fungus responsible for common smut of corn and is a model organism for the study of fungal phytopathogenesis. To aid in the annotation of the genome sequence of this organism, several expressed sequence tag (EST) libraries were generated from a variety of U. maydis cell types. In addition to utility in the context of gene identification and structure annotation, the ESTs were analyzed to identify differentially abundant transcripts and to detect evidence of alternative splicing and anti-sense transcription. RESULTS: Four cDNA libraries were constructed using RNA isolated from U. maydis diploid teliospores (U. maydis strains 518 x 521) and haploid cells of strain 521 grown under nutrient rich, carbon starved, and nitrogen starved conditions. Using the genome sequence as a scaffold, the 15,901 ESTs were assembled into 6,101 contiguous expressed sequences (contigs); among these, 5,482 corresponded to predicted genes in the MUMDB (MIPS Ustilago maydis database), while 619 aligned to regions of the genome not yet designated as genes in MUMDB. A comparison of EST abundance identified numerous genes that may be regulated in a cell type or starvation-specific manner. The transcriptional response to nitrogen starvation was assessed using RT-qPCR. The results of this suggest that there may be cross-talk between the nitrogen and carbon signalling pathways in U. maydis. Bioinformatic analysis identified numerous examples of alternative splicing and anti-sense transcription. While intron retention was the predominant form of alternative splicing in U. maydis, other varieties were also evident (e.g. exon skipping). Selected instances of both alternative splicing and anti-sense transcription were independently confirmed using RT-PCR. CONCLUSION: Through this work: 1) substantial sequence information has been provided for U. maydis genome annotation; 2) new genes were identified through the discovery of 619 contigs that had previously escaped annotation; 3) evidence is provided that suggests the regulation of nitrogen metabolism in U. maydis differs from that of other model fungi, and 4) Alternative splicing and anti-sense transcription were identified in U. maydis and, amid similar observations in other basidiomycetes, this suggests these phenomena may be widespread in this group of fungi. These advances emphasize the importance of EST analysis in genome annotation.
Project description:BACKGROUND: Within the framework of a genomics project on livestock species (AGENAE), we initiated a high-throughput DNA sequencing program of Expressed Sequence Tags (ESTs) in rainbow trout, Oncorhynchus mykiss. RESULTS: We constructed three cDNA libraries including one highly complex pooled-tissue library. These libraries were normalized and subtracted to reduce clone redundancy. ESTs sequences were produced, and 96,472 ESTs corresponding to high quality sequence reads were released on the international database, currently representing 42.5% of the overall sequence knowledge in this species. All these EST sequences and other publicly available ESTs in rainbow trout have been included on a publicly available Website (SIGENAE) and have been clustered into a total of 52,930 clusters of putative transcripts groups, including 24,616 singletons. 57.1% of these 52,930 clusters are represented by at least one Agenae EST and 14,343 clusters (27.1%) are only composed by Agenae ESTs. Sequence analysis also reveals that normalization and especially subtraction were effective in decreasing redundancy, and that the pooled-tissue library was representative of the initial tissue complexity. CONCLUSION: Due to present work on the construction of rainbow trout normalized cDNA libraries and their extensive sequencing, along with other large scale sequencing programs, rainbow trout is now one of the major fish models in term of EST sequences available in a public database, just after Zebrafish, Danio rerio. This information is now used for the selection of a non redundant set of clones for producing DNA micro-arrays in order to examine global gene expression.
Project description:BACKGROUND: EST sequencing is one of the most efficient means for gene discovery and molecular marker development, and can be additionally utilized in both comparative genome analysis and evaluation of gene duplications. While much progress has been made in catfish genomics, large-scale EST resources have been lacking. The objectives of this project were to construct primary cDNA libraries, to conduct initial EST sequencing to generate catfish EST resources, and to obtain baseline information about highly expressed genes in various catfish organs to provide a guide for the production of normalized and subtracted cDNA libraries for large-scale transcriptome analysis in catfish. RESULTS: A total of 17 cDNA libraries were constructed including 12 from channel catfish (Ictalurus punctatus) and 5 from blue catfish (I. furcatus). A total of 31,215 ESTs, with average length of 778 bp, were generated including 20,451 from the channel catfish and 10,764 from blue catfish. Cluster analysis indicated that 73% of channel catfish and 67% of blue catfish ESTs were unique within the project. Over 53% and 50% of the channel catfish and blue catfish ESTs, respectively, had significant similarities to known genes. All ESTs have been deposited in GenBank. Evaluation of the catfish EST resources demonstrated their potential for molecular marker development, comparative genome analysis, and evaluation of ancient and recent gene duplications. Subtraction of abundantly expressed genes in a variety of catfish tissues, identified here, will allow the production of low-redundancy libraries for in-depth sequencing. CONCLUSION: The sequencing of 31,215 ESTs from channel catfish and blue catfish has significantly increased the EST resources in catfish. The EST resources should provide the potential for microarray development, polymorphic marker identification, mapping, and comparative genome analysis.
Project description:BACKGROUND: Anthracnose (Colletotrichum gloeosporioides) is a major limiting factor in the production of yam (Dioscorea spp.) worldwide. Availability of high quality sequence information is necessary for designing molecular markers associated with resistance. However, very limited sequence information pertaining to yam is available at public genome databases. Therefore, this collaborative project was developed for genetic improvement and germplasm characterization of yams using molecular markers. The current investigation is focused on studying gene expression, by large scale generation of ESTs, from one susceptible (TDa 95-0310) and two resistant yam genotypes (TDa 87-01091, TDa 95-0328) challenged with the fungus. Total RNA was isolated from young leaves of resistant and susceptible genotypes and cDNA libraries were sequenced using Roche 454 technology. RESULTS: A total of 44,757 EST sequences were generated from the cDNA libraries of the resistant and susceptible genotypes. Greater than 56% of ESTs were annotated using MapMan Mercator tool and Blast2GO search tools. Gene annotations were used to characterize the transcriptome in yam and also perform a differential gene expression analysis between the resistant and susceptible EST datasets. Mining for SSRs in the ESTs revealed 1702 unique sequences containing SSRs and 1705 SSR markers were designed using those sequences. CONCLUSION: We have developed a comprehensive annotated transcriptome data set in yam to enrich the EST information in public databases. cDNA libraries were constructed from anthracnose fungus challenged leaf tissues for transcriptome characterization, and differential gene expression analysis. Thus, it helped in identifying unique transcripts in each library for disease resistance. These EST resources provide the basis for future microarray development, marker validation, genetic linkage mapping and QTL analysis in Dioscorea species.