Annotated expressed sequence tags (ESTs) from pre-smolt Atlantic salmon (Salmo salar) in a searchable data resource.
ABSTRACT: BACKGROUND: To identify as many different transcripts/genes in the Atlantic salmon genome as possible, it is crucial to acquire good cDNA libraries from different tissues and developmental stages, their relevant sequences (ESTs or full length sequences) and attempt to predict function. Such libraries allow identification of a large number of different transcripts and can provide valuable information on genes expressed in a particular tissue at a specific developmental stage. This data is important in constructing a microarray chip, identifying SNPs in coding regions, and for future identification of genes in the whole genome sequence. An important factor that determines the usefulness of generated data for biologists is efficient data access. Public searchable databases play a crucial role in providing such service. DESCRIPTION: Twenty-three Atlantic salmon cDNA libraries were constructed from 15 tissues, yielding nearly 155,000 clones. From these libraries 58,109 ESTs were generated, of which 57,212 were used for contig assembly. Following deletion of mitochondrial sequences 55,118 EST sequences were submitted to GenBank. In all, 20,019 unique sequences, consisting of 6,424 contigs and 13,595 singlets, were generated. The Norwegian Salmon Genome Project Database has been constructed and annotation performed by the annotation transfer approach. Annotation was successful for 50.3% (10,075) of the sequences and 6,113 sequences (30.5%) were annotated with Gene Ontology terms for molecular function, biological process and cellular component. CONCLUSION: We describe the construction of cDNA libraries from juvenile/pre-smolt Atlantic salmon (Salmo salar), EST sequencing, clustering, and annotation by assigning putative function to the transcripts. These sequences represents 97% of all sequences submitted to GenBank from the pre-smoltification stage. The data has been grouped into datasets according to its source and type of annotation. Various data query options are offered including searches on function assignments and Gene Ontology terms. Data delivery options include summaries for the datasets and their annotations, detailed self-explanatory annotations, and access to the original BLAST results and Gene Ontology annotation trees. Potential presence of a relatively high number of immune-related genes in the dataset was shown by annotation searches.
Project description:We report 80,388 ESTs from 23 Atlantic salmon (Salmo salar) cDNA libraries (61,819 ESTs), 6 rainbow trout (Oncorhynchus mykiss) cDNA libraries (14,544 ESTs), 2 chinook salmon (Oncorhynchus tshawytscha) cDNA libraries (1317 ESTs), 2 sockeye salmon (Oncorhynchus nerka) cDNA libraries (1243 ESTs), and 2 lake whitefish (Coregonus clupeaformis) cDNA libraries (1465 ESTs). The majority of these are 3' sequences, allowing discrimination between paralogs arising from a recent genome duplication in the salmonid lineage. Sequence assembly reveals 28,710 different S. salar, 8981 O. mykiss, 1085 O. tshawytscha, 520 O. nerka, and 1176 C. clupeaformis putative transcripts. We annotate the submitted portion of our EST database by molecular function. Higher- and lower-molecular-weight fractions of libraries are shown to contain distinct gene sets, and higher rates of gene discovery are associated with higher-molecular weight libraries. Pyloric caecum library group annotations indicate this organ may function in redox control and as a barrier against systemic uptake of xenobiotics. A microarray is described, containing 7356 salmonid elements representing 3557 different cDNAs. Analyses of cross-species hybridizations to this cDNA microarray indicate that this resource may be used for studies involving all salmonids.
Project description:BACKGROUND: Sequencing of the Atlantic salmon genome is now being planned by an international research consortium. Full-length sequenced inserts from cDNAs (FLIcs) are an important tool for correct annotation and clustering of the genomic sequence in any species. The large amount of highly similar duplicate sequences caused by the relatively recent genome duplication in the salmonid ancestor represents a particular challenge for the genome project. FLIcs will therefore be an extremely useful resource for the Atlantic salmon sequencing project. In addition to be helpful in order to distinguish between duplicate genome regions and in determining correct gene structures, FLIcs are an important resource for functional genomic studies and for investigation of regulatory elements controlling gene expression. In contrast to the large number of ESTs available, including the ESTs from 23 developmental and tissue specific cDNA libraries contributed by the Salmon Genome Project (SGP), the number of sequences where the full-length of the cDNA insert has been determined has been small. RESULTS: High quality full-length insert sequences from 560 pre-smolt white muscle tissue specific cDNAs were generated, accession numbers [GenBank: BT043497 - BT044056]. Five hundred and ten (91%) of the transcripts were annotated using Gene Ontology (GO) terms and 440 of the FLIcs are likely to contain a complete coding sequence (cCDS). The sequence information was used to identify putative paralogs, characterize salmon Kozak motifs, polyadenylation signal variation and to identify motifs likely to be involved in the regulation of particular genes. Finally, conserved 7-mers in the 3'UTRs were identified, of which some were identical to miRNA target sequences. CONCLUSION: This paper describes the first Atlantic salmon FLIcs from a tissue and developmental stage specific cDNA library. We have demonstrated that many FLIcs contained a complete coding sequence (cCDS). This suggests that the remaining cDNA libraries generated by SGP represent a valuable cCDS FLIc source. The conservation of 7-mers in 3'UTRs indicates that these motifs are functionally important. Identity between some of these 7-mers and miRNA target sequences suggests that they are miRNA targets in Salmo salar transcripts as well.
Project description:BACKGROUND: Salmonids are of interest because of their relatively recent genome duplication, and their extensive use in wild fisheries and aquaculture. A comprehensive gene list and a comparison of genes in some of the different species provide valuable genomic information for one of the most widely studied groups of fish. RESULTS: 298,304 expressed sequence tags (ESTs) from Atlantic salmon (69% of the total), 11,664 chinook, 10,813 sockeye, 10,051 brook trout, 10,975 grayling, 8,630 lake whitefish, and 3,624 northern pike ESTs were obtained in this study and have been deposited into the public databases. Contigs were built and putative full-length Atlantic salmon clones have been identified. A database containing ESTs, assemblies, consensus sequences, open reading frames, gene predictions and putative annotation is available. The overall similarity between Atlantic salmon ESTs and those of rainbow trout, chinook, sockeye, brook trout, grayling, lake whitefish, northern pike and rainbow smelt is 93.4, 94.2, 94.6, 94.4, 92.5, 91.7, 89.6, and 86.2% respectively. An analysis of 78 transcript sets show Salmo as a sister group to Oncorhynchus and Salvelinus within Salmoninae, and Thymallinae as a sister group to Salmoninae and Coregoninae within Salmonidae. Extensive gene duplication is consistent with a genome duplication in the common ancestor of salmonids. Using all of the available EST data, a new expanded salmonid cDNA microarray of 32,000 features was created. Cross-species hybridizations to this cDNA microarray indicate that this resource will be useful for studies of all 68 salmonid species. CONCLUSION: An extensive collection and analysis of salmonid RNA putative transcripts indicate that Pacific salmon, Atlantic salmon and charr are 94-96% similar while the more distant whitefish, grayling, pike and smelt are 93, 92, 89 and 86% similar to salmon. The salmonid transcriptome reveals a complex history of gene duplication that is consistent with an ancestral salmonid genome duplication hypothesis. Genome resources, including a new 32 K microarray, provide valuable new tools to study salmonids.
Project description:BACKGROUND: The Atlantic salmon is an important aquaculture species and a very interesting species biologically, since it spawns in fresh water and develops through several stages before becoming a smolt, the stage at which it migrates to the sea to feed. The dramatic change of habitat requires physiological, morphological and behavioural changes to prepare the salmon for its new environment. These changes are called the parr-smolt transformation or smoltification, and pre-adapt the salmon for survival and growth in the marine environment. The development of hypo-osmotic regulatory ability plays an important part in facilitating the transition from rivers to the sea. The physiological mechanisms behind the developmental changes are largely unknown. An understanding of the transformation process will be vital to the future of the aquaculture industry. A knowledge of which genes are expressed prior to the smoltification process is an important basis for further studies. RESULTS: In all, 2974 unique sequences, consisting of 779 contigs and 2195 singlets, were generated for Atlantic salmon from two cDNA libraries constructed from the gills and the intestine, accession numbers [Genbank: CK877169-CK879929, CK884015-CK886537 and CN181112-CN181464]. Nearly 50% of the sequences were assigned putative functions because they showed similarity to known genes, mostly from other species, in one or more of the databases used. The Swiss-Prot database returned significant hits for 1005 sequences. These could be assigned predicted gene products, and 967 were annotated using Gene Ontology (GO) terms for molecular function, biological process and/or cellular component, employing an annotation transfer procedure. CONCLUSION: This paper describes the construction of two cDNA libraries from pre-smolt Atlantic salmon (Salmo salar) and the subsequent EST sequencing, clustering and assigning of putative function to 1005 genes expressed in the gills and/or intestine.
Project description:The crop expressed sequence tag database, CR-EST (http://pgrc.ipk-gatersleben.de/cr-est/), is a publicly available online resource providing access to sequence, classification, clustering and annotation data of crop EST projects. CR-EST currently holds more than 200,000 sequences derived from 41 cDNA libraries of four species: barley, wheat, pea and potato. The barley section comprises approximately one-third of all publicly available ESTs. CR-EST deploys an automatic EST preparation pipeline that includes the identification of chimeric clones in order to transparently display the data quality. Sequences are clustered in species-specific projects to currently generate a non-redundant set of approximately 22,600 consensus sequences and approximately 17,200 singletons, which form the basis of the provided set of unigenes. A web application allows the user to compute BLAST alignments of query sequences against the CR-EST database, query data from Gene Ontology and metabolic pathway annotations and query sequence similarities from stored BLAST results. CR-EST also features interactive JAVA-based tools, allowing the visualization of open reading frames and the explorative analysis of Gene Ontology mappings applied to ESTs.
Project description:The entomopathogenic nematode Heterorhabditis bacteriophora and its symbiotic bacterium, Photorhabdus luminescens, are important biological control agents of insect pests. This nematode-bacterium-insect association represents an emerging tripartite model for research on mutualistic and parasitic symbioses. Elucidation of mechanisms underlying these biological processes may serve as a foundation for improving the biological control potential of the nematode-bacterium complex. This large-scale expressed sequence tag (EST) analysis effort enables gene discovery and development of microsatellite markers. These ESTs will also aid in the annotation of the upcoming complete genome sequence of H. bacteriophora.A total of 31,485 high quality ESTs were generated from cDNA libraries of the adult H. bacteriophora TTO1 strain. Cluster analysis revealed the presence of 3,051 contigs and 7,835 singletons, representing 10,886 distinct EST sequences. About 72% of the distinct EST sequences had significant matches (E value < 1e-5) to proteins in GenBank's non-redundant (nr) and Wormpep190 databases. We have identified 12 ESTs corresponding to 8 genes potentially involved in RNA interference, 22 ESTs corresponding to 14 genes potentially involved in dauer-related processes, and 51 ESTs corresponding to 27 genes potentially involved in defense and stress responses. Comparison to ESTs and proteins of free-living nematodes led to the identification of 554 parasitic nematode-specific ESTs in H. bacteriophora, among which are those encoding F-box-like/WD-repeat protein theromacin, Bax inhibitor-1-like protein, and PAZ domain containing protein. Gene Ontology terms were assigned to 6,685 of the 10,886 ESTs. A total of 168 microsatellite loci were identified with primers designable for 141 loci.A total of 10,886 distinct EST sequences were identified from adult H. bacteriophora cDNA libraries. BLAST searches revealed ESTs potentially involved in parasitism, RNA interference, defense responses, stress responses, and dauer-related processes. The putative microsatellite markers identified in H. bacteriophora ESTs will enable genetic mapping and population genetic studies. These genomic resources provide the material base necessary for genome annotation, microarray development, and in-depth gene functional analysis.
Project description:BACKGROUND: The Atlantic cod is an ecologically and economically important North Atlantic fish species and also an emerging aquaculture species. To study gene expression in Atlantic cod embryonic stem (ES) cells, our goal was to generate and analyze expressed sequence tags (ESTs) from an ES cell cDNA library of mRNA consisting of approximately 3,900 ESTs. RESULTS: We sequenced 3,935 EST clones using a directional cDNA library made from pooled ES cells harvested at the blastula stage. Quality filtering of these ESTs allowed identification of 2,719 high-quality sequences with an average length of 442 bp containing 368 contigs and 1,276 singletons (1,644 unique sequences). BLASTX searches produced 889 significant (E-value < 10-3) hits, of which 698 (42.5%) were annotated with Gene Ontology terms (E-value < 10-6). The number of unknown unique sequences was 946 (57.5%). All the high-quality EST sequences have been deposited in GenBank (GenBank: 2,719 sequences in UniGene library dbEST id: 22,021). Gene discovery and annotations are presented and discussed. CONCLUSION: This set of ESTs represents one of the first attempts to describe mRNA in ES cells from a marine cold-water fish species, and provides a basis for gene expression studies of Atlantic cod ES cells.
Project description:BACKGROUND: Poor reproductive maturation of the black tiger shrimp (Penaeus monodon) in captivity is one of the serious threats to sustainability of the shrimp farming industry. Understanding molecular mechanisms governing reproductive maturation processes requires the fundamental knowledge of integrated expression profiles in gonads of this economically important species. In P. monodon, a non-model species for which the genome sequence is not available, expressed sequence tag (EST) and cDNA microarray analyses can help reveal important transcripts relevant to reproduction and facilitate functional characterization of transcripts with important roles in male reproductive development and maturation. RESULTS: In this study, a conventional testis EST library was exploited to reveal novel transcripts. A total of 4,803 ESTs were unidirectionally sequenced and analyzed in silico using a customizable data analysis package, ESTplus. After sequence assembly, 2,702 unique sequences comprised of 424 contigs and 2,278 singletons were identified; of these, 1,133 sequences are homologous to genes with known functions. The sequences were further characterized according to gene ontology categories (41% biological process, 24% molecular function, 35% cellular component). Through comparison with EST libraries of other tissues of P. monodon, 1,579 transcripts found only in the testis cDNA library were identified. A total of 621 ESTs have not been identified in penaeid shrimp. Furthermore, cDNA microarray analysis revealed several ESTs homologous to testis-relevant genes were more preferentially expressed in testis than in ovary. Representatives of these transcripts, homologs of saposin (PmSap) and Dmc1 (PmDmc1), were further characterized by RACE-PCR. The more abundant expression levels in testis than ovary of PmSap and PmDmc1 were verified by quantitative real-time PCR in juveniles and wild broodstock of P. monodon. CONCLUSIONS: Without a genome sequence, a combination of EST analysis and high-throughput cDNA microarray technology can be a useful integrated tool as an initial step towards the identification of transcripts with important biological functions. Identification and expression analysis of saposin and Dmc1 homologs demonstrate the power of these methods for characterizing functionally important genes in P. monodon.
Project description:BACKGROUND: Single-pass, partial sequencing of complementary DNA (cDNA) libraries generates thousands of chromatograms that are processed into high quality expressed sequence tags (ESTs), and then assembled into contigs representative of putative genes. Usually, to be of value, ESTs and contigs must be associated with meaningful annotations, and made available to end-users. RESULTS: A web application, Expressed Sequence Tag Information Management and Annotation (ESTIMA), has been created to meet the EST annotation and data management requirements of multiple high-throughput EST sequencing projects. It is anchored on individual ESTs and organized around different properties of ESTs including chromatograms, base-calling quality scores, structure of assembled transcripts, and multiple sources of comparison to infer functional annotation, Gene Ontology associations, and cDNA library information. ESTIMA consists of a relational database schema and a set of interactive query interfaces. These are integrated with a suite of web-based tools that allow a user to query and retrieve information. Further, query results are interconnected among the various EST properties. ESTIMA has several unique features. Users may run their own EST processing pipeline, search against preferred reference genomes, and use any clustering and assembly algorithm. The ESTIMA database schema is very flexible and accepts output from any EST processing and assembly pipeline. ESTIMA has been used for the management of EST projects of many species, including honeybee (Apis mellifera), cattle (Bos taurus), songbird (Taeniopygia guttata), corn rootworm (Diabrotica vergifera), catfish (Ictalurus punctatus, Ictalurus furcatus), and apple (Malus x domestica). The entire resource may be downloaded and used as is, or readily adapted to fit the unique needs of other cDNA sequencing projects. CONCLUSIONS: The scripts used to create the ESTIMA interface are freely available to academic users in an archived format from http://titan.biotec.uiuc.edu/ESTIMA/. The entity-relationship (E-R) diagrams and the programs used to generate the Oracle database tables are also available. We have also provided detailed installation instructions and a tutorial at the same website. Presently the chromatograms, EST databases and their annotations have been made available for cattle and honeybee brain EST projects. Non-academic users need to contact the W.M. Keck Center for Functional and Comparative Genomics, University of Illinois at Urbana-Champaign, Urbana, IL, for licensing information.
Project description:To contribute to our understanding of the genome complexity of sugarcane, we undertook a large-scale expressed sequence tag (EST) program. More than 260,000 cDNA clones were partially sequenced from 26 standard cDNA libraries generated from different sugarcane tissues. After the processing of the sequences, 237,954 high-quality ESTs were identified. These ESTs were assembled into 43,141 putative transcripts. Of the assembled sequences, 35.6% presented no matches with existing sequences in public databases. A global analysis of the whole SUCEST data set indicated that 14,409 assembled sequences (33% of the total) contained at least one cDNA clone with a full-length insert. Annotation of the 43,141 assembled sequences associated almost 50% of the putative identified sugarcane genes with protein metabolism, cellular communication/signal transduction, bioenergetics, and stress responses. Inspection of the translated assembled sequences for conserved protein domains revealed 40,821 amino acid sequences with 1415 Pfam domains. Reassembling the consensus sequences of the 43,141 transcripts revealed a 22% redundancy in the first assembling. This indicated that possibly 33,620 unique genes had been identified and indicated that >90% of the sugarcane expressed genes were tagged.