ESAP plus: a web-based server for EST-SSR marker development.
ABSTRACT: Simple sequence repeats (SSRs) have become widely used as molecular markers in plant genetic studies due to their abundance, high allelic variation at each locus and simplicity to analyze using conventional PCR amplification. To study plants with unknown genome sequence, SSR markers from Expressed Sequence Tags (ESTs), which can be obtained from the plant mRNA (converted to cDNA), must be utilized. With the advent of high-throughput sequencing technology, huge EST sequence data have been generated and are now accessible from many public databases. However, SSR marker identification from a large in-house or public EST collection requires a computational pipeline that makes use of several standard bioinformatic tools to design high quality EST-SSR primers. Some of these computational tools are not users friendly and must be tightly integrated with reference genomic databases.A web-based bioinformatic pipeline, called EST Analysis Pipeline Plus (ESAP Plus), was constructed for assisting researchers to develop SSR markers from a large EST collection. ESAP Plus incorporates several bioinformatic scripts and some useful standard software tools necessary for the four main procedures of EST-SSR marker development, namely 1) pre-processing, 2) clustering and assembly, 3) SSR mining and 4) SSR primer design. The proposed pipeline also provides two alternative steps for reducing EST redundancy and identifying SSR loci. Using public sugarcane ESTs, ESAP Plus automatically executed the aforementioned computational pipeline via a simple web user interface, which was implemented using standard PHP, HTML, CSS and Java scripts. With ESAP Plus, users can upload raw EST data and choose various filtering options and parameters to analyze each of the four main procedures through this web interface. All input EST data and their predicted SSR results will be stored in the ESAP Plus MySQL database. Users will be notified via e-mail when the automatic process is completed and they can download all the results through the web interface.ESAP Plus is a comprehensive and convenient web-based bioinformatic tool for SSR marker development. ESAP Plus offers all necessary EST-SSR development processes with various adjustable options that users can easily use to identify SSR markers from a large EST collection. With familiar web interface, users can upload the raw EST using the data submission page and visualize/download the corresponding EST-SSR information from within ESAP Plus. ESAP Plus can handle considerably large EST datasets. This EST-SSR discovery tool can be accessed directly from: http://gbp.kku.ac.th/esap_plus/ .
Project description:BACKGROUND: Single-pass, partial sequencing of complementary DNA (cDNA) libraries generates thousands of chromatograms that are processed into high quality expressed sequence tags (ESTs), and then assembled into contigs representative of putative genes. Usually, to be of value, ESTs and contigs must be associated with meaningful annotations, and made available to end-users. RESULTS: A web application, Expressed Sequence Tag Information Management and Annotation (ESTIMA), has been created to meet the EST annotation and data management requirements of multiple high-throughput EST sequencing projects. It is anchored on individual ESTs and organized around different properties of ESTs including chromatograms, base-calling quality scores, structure of assembled transcripts, and multiple sources of comparison to infer functional annotation, Gene Ontology associations, and cDNA library information. ESTIMA consists of a relational database schema and a set of interactive query interfaces. These are integrated with a suite of web-based tools that allow a user to query and retrieve information. Further, query results are interconnected among the various EST properties. ESTIMA has several unique features. Users may run their own EST processing pipeline, search against preferred reference genomes, and use any clustering and assembly algorithm. The ESTIMA database schema is very flexible and accepts output from any EST processing and assembly pipeline. ESTIMA has been used for the management of EST projects of many species, including honeybee (Apis mellifera), cattle (Bos taurus), songbird (Taeniopygia guttata), corn rootworm (Diabrotica vergifera), catfish (Ictalurus punctatus, Ictalurus furcatus), and apple (Malus x domestica). The entire resource may be downloaded and used as is, or readily adapted to fit the unique needs of other cDNA sequencing projects. CONCLUSIONS: The scripts used to create the ESTIMA interface are freely available to academic users in an archived format from http://titan.biotec.uiuc.edu/ESTIMA/. The entity-relationship (E-R) diagrams and the programs used to generate the Oracle database tables are also available. We have also provided detailed installation instructions and a tutorial at the same website. Presently the chromatograms, EST databases and their annotations have been made available for cattle and honeybee brain EST projects. Non-academic users need to contact the W.M. Keck Center for Functional and Comparative Genomics, University of Illinois at Urbana-Champaign, Urbana, IL, for licensing information.
Project description:BACKGROUND: Microsatellites or simple sequence repeats (SSRs) in expressed sequence tags (ESTs) are useful resources for genome analysis because of their abundance, functionality and polymorphism. The advent of commercial second generation sequencing machines has lead to new strategies for developing EST-SSR markers, necessitating the development of bioinformatic framework that can keep pace with the increasing quality and quantity of sequence data produced. We describe an open scheme for analyzing ESTs and developing EST-SSR markers from reads collected by Sanger sequencing and pyrosequencing of sugi (Cryptomeria japonica). RESULTS: We collected 141,097 sequence reads by Sanger sequencing and 1,333,444 by pyrosequencing. After trimming contaminant and low quality sequences, 118,319 Sanger and 1,201,150 pyrosequencing reads were passed to the MIRA assembler, generating 81,284 contigs that were analysed for SSRs. 4,059 SSRs were found in 3,694 (4.54%) contigs, giving an SSR frequency lower than that in seven other plant species with gene indices (5.4-21.9%). The average GC content of the SSR-containing contigs was 41.55%, compared to 40.23% for all contigs. Tri-SSRs were the most common SSRs; the most common motif was AT, which was found in 655 (46.3%) di-SSRs, followed by the AAG motif, found in 342 (25.9%) tri-SSRs. Most (72.8%) tri-SSRs were in coding regions, but 55.6% of the di-SSRs were in non-coding regions; the AT motif was most abundant in 3' untranslated regions. Gene ontology (GO) annotations showed that six GO terms were significantly overrepresented within SSR-containing contigs. Forty-four EST-SSR markers were developed from 192 primer pairs using two pipelines: read2Marker and the newly-developed CMiB, which combines several open tools. Markers resulting from both pipelines showed no differences in PCR success rate and polymorphisms, but PCR success and polymorphism were significantly affected by the expected PCR product size and number of SSR repeats, respectively. EST-SSR markers exhibited less polymorphism than genomic SSRs. CONCLUSIONS: We have created a new open pipeline for developing EST-SSR markers and applied it in a comprehensive analysis of EST-SSRs and EST-SSR markers in C. japonica. The results will be useful in genomic analyses of conifers and other non-model species.
Project description:Grape expressed sequence tags (ESTs) are a new resource for developing simple sequence repeat (SSR) functional markers for genotyping and genetic mapping. An integrated pipeline including several computational tools for SSR identification and functional annotation was developed to identify 6,447 EST-SSR sequences from a total collection of 215,609 grape ESTs retrieved from NCBI. The 6,447 EST-SSRs were further reduced to 1,701 non-redundant sequences via clustering analysis, and 1,037 of them were successfully designed with primer pairs flanking the SSR motifs. From them, 150 pairs of primers were randomly selected for PCR amplification, polymorphism and heterozygosity analysis in V. vinifera cvs. Riesling and Cabernet Sauvignon, and V. rotundifolia (muscadine grape) cvs. Summit and Noble, and 145 pairs of these primers yielded PCR products. Pairwise comparisons of loci between the parents Riesling and Cabernet Sauvignon showed that 72 were homozygous in both cultivars, while 70 loci were heterozygous in at least one cultivar of the two. Muscadine parents Noble and Summit had 90 homozygous SSR loci in both parents and contained 50 heterozygous loci in at least one of the two. These EST-SSR functional markers are a useful addition for grape genotyping and genome mapping. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1007/s11032-010-9477-2) contains supplementary material, which is available to authorized users.
Project description:We present a Galaxy based web server for processing and visualizing deeply sequenced data. The web server's core functionality consists of a suite of newly developed tools, called deepTools, that enable users with little bioinformatic background to explore the results of their sequencing experiments in a standardized setting. Users can upload pre-processed files with continuous data in standard formats and generate heatmaps and summary plots in a straight-forward, yet highly customizable manner. In addition, we offer several tools for the analysis of files containing aligned reads and enable efficient and reproducible generation of normalized coverage files. As a modular and open-source platform, deepTools can easily be expanded and customized to future demands and developments. The deepTools webserver is freely available at http://deeptools.ie-freiburg.mpg.de and is accompanied by extensive documentation and tutorials aimed at conveying the principles of deep-sequencing data analysis. The web server can be used without registration. deepTools can be installed locally either stand-alone or as part of Galaxy.
Project description:BACKGROUND: The ESTuber database (http://www.itb.cnr.it/estuber) includes 3,271 Tuber borchii expressed sequence tags (EST). The dataset consists of 2,389 sequences from an in-house prepared cDNA library from truffle vegetative hyphae, and 882 sequences downloaded from GenBank and representing four libraries from white truffle mycelia and ascocarps at different developmental stages. An automated pipeline was prepared to process EST sequences using public software integrated by in-house developed Perl scripts. Data were collected in a MySQL database, which can be queried via a php-based web interface. RESULTS: Sequences included in the ESTuber db were clustered and annotated against three databases: the GenBank nr database, the UniProtKB database and a third in-house prepared database of fungi genomic sequences. An algorithm was implemented to infer statistical classification among Gene Ontology categories from the ontology occurrences deduced from the annotation procedure against the UniProtKB database. Ontologies were also deduced from the annotation of more than 130,000 EST sequences from five filamentous fungi, for intra-species comparison purposes. Further analyses were performed on the ESTuber db dataset, including tandem repeats search and comparison of the putative protein dataset inferred from the EST sequences to the PROSITE database for protein patterns identification. All the analyses were performed both on the complete sequence dataset and on the contig consensus sequences generated by the EST assembly procedure. CONCLUSION: The resulting web site is a resource of data and links related to truffle expressed genes. The Sequence Report and Contig Report pages are the web interface core structures which, together with the Text search utility and the Blast utility, allow easy access to the data stored in the database.
Project description:Deciphering relevant biological insights from epigenomic data can be a challenging task. One commonly used approach is to perform enrichment analysis. However, finding, downloading and using the publicly available functional annotations require time, programming skills and IT infrastructure. Here we describe the online tool EpiAnnotator for performing enrichment analyses on epigenomic data in a fast and user-friendly way.EpiAnnotator is an R Package accompanied by a web interface. It contains regularly updated annotations from 4 public databases: Blueprint, RoadMap, GENCODE and the UCSC Genome Browser. Annotations are hosted locally or in a server environment and automatically updated by scripts of our own design. Thousands of tracks are available, reflecting data on a variety of tissues, cell types and cell lines from the human and mouse genomes. Users need to upload sets of selected and background regions. Results are displayed in customizable and easily interpretable figures.The R package and Shiny app are open source and available under the GPL v3 license. EpiAnnotator's web interface is accessible at http://firstname.lastname@example.org.
Project description:BACKGROUND: The ESTree db http://www.itb.cnr.it/estree/ represents a collection of Prunus persica expressed sequenced tags (ESTs) and is intended as a resource for peach functional genomics. A total of 6,155 successful EST sequences were obtained from four in-house prepared cDNA libraries from Prunus persica mesocarps at different developmental stages. Another 12,475 peach EST sequences were downloaded from public databases and added to the ESTree db. An automated pipeline was prepared to process EST sequences using public software integrated by in-house developed Perl scripts and data were collected in a MySQL database. A php-based web interface was developed to query the database. RESULTS: The ESTree db version as of April 2005 encompasses 18,630 sequences representing eight libraries. Contig assembly was performed with CAP3. Putative single nucleotide polymorphism (SNP) detection was performed with the AutoSNP program and a search engine was implemented to retrieve results. All the sequences and all the contig consensus sequences were annotated both with blastx against the GenBank nr db and with GOblet against the viridiplantae section of the Gene Ontology db. Links to NiceZyme (Expasy) and to the KEGG metabolic pathways were provided. A local BLAST utility is available. A text search utility allows querying and browsing the database. Statistics were provided on Gene Ontology occurrences to assign sequences to Gene Ontology categories. CONCLUSION: The resulting database is a comprehensive resource of data and links related to peach EST sequences. The Sequence Report and Contig Report pages work as the web interface core structures, giving quick access to data related to each sequence/contig.
Project description:BACKGROUND:Genome-wide association studies (GWAS) are an important tool for the mapping of complex traits and diseases. Visual inspection of genomic annotations may be used to generate insights into the biological mechanisms underlying GWAS-identified loci. RESULTS:We developed LocusTrack, a web-based application that annotates and creates plots of regional GWAS results and incorporates user-specified tracks that display annotations such as linkage disequilibrium (LD), phylogenetic conservation, chromatin state, and other genomic and regulatory elements. Currently, LocusTrack can integrate annotation tracks from the UCSC genome-browser as well as from any tracks provided by the user. CONCLUSION:LocusTrack is an easy-to-use application and can be accessed at the following URL: http://gump.qimr.edu.au/general/gabrieC/LocusTrack/. Users can upload and manage GWAS results and select from and/or provide annotation tracks using simple and intuitive menus. LocusTrack scripts and associated data can be downloaded from the website and run locally.
Project description:Single nucleotide polymorphisms (SNPs) represent the most abundant type of genetic variation that can be used as molecular markers. The SNPs that are hidden in sequence databases can be unlocked using bioinformatic tools. For efficient application of these SNPs, the sequence set should be error-free as much as possible, targeting single loci and suitable for the SNP scoring platform of choice. We have developed a pipeline to effectively mine SNPs from public EST databases with or without quality information using QualitySNP software, select reliable SNP and prepare the loci for analysis on the Illumina GoldenGate genotyping platform. The applicability of the pipeline was demonstrated using publicly available potato EST data, genotyping individuals from two diploid mapping populations and subsequently mapping the SNP markers (putative genes) in both populations. Over 7000 reliable SNPs were identified that met the criteria for genotyping on the GoldenGate platform. Of the 384 SNPs on the SNP array approximately 12% dropped out. For the two potato mapping populations 165 and 185 SNPs segregating SNP loci could be mapped on the respective genetic maps, illustrating the effectiveness of our pipeline for SNP selection and validation. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1007/s11032-009-9377-5) contains supplementary material, which is available to authorized users.
Project description:BACKGROUND: Simple Sequence Repeat (SSR) or microsatellite markers are valuable for genetic research. Experimental methods to develop SSR markers are laborious, time consuming and expensive. In silico approaches have become a practicable and relatively inexpensive alternative during the last decade, although testing putative SSR markers still is time consuming and expensive. In many species only a relatively small percentage of SSR markers turn out to be polymorphic. This is particularly true for markers derived from expressed sequence tags (ESTs). In EST databases a large redundancy of sequences is present, which may contain information on length-polymorphisms in the SSR they contain, and whether they have been derived from heterozygotes or from different genotypes. Up to now, although a number of programs have been developed to identify SSRs in EST sequences, no software can detect putatively polymorphic SSRs. RESULTS: We have developed PolySSR, a new pipeline to identify polymorphic SSRs rather than just SSRs. Sequence information is obtained from public EST databases derived from heterozygous individuals and/or at least two different genotypes. The pipeline includes PCR-primer design for the putatively polymorphic SSR markers, taking into account Single Nucleotide Polymorphisms (SNPs) in the flanking regions, thereby improving the success rate of the potential markers. A large number of polymorphic SSRs were identified using publicly available EST sequences of potato, tomato, rice, Arabidopsis, Brassica and chicken.The SSRs obtained were divided into long and short based on the number of times the motif was repeated. Surprisingly, the frequency of polymorphic SSRs was much higher in the short SSRs. CONCLUSION: PolySSR is a very effective tool to identify polymorphic SSRs. Using PolySSR, several hundred putative markers were developed and stored in a searchable database. Validation experiments showed that almost all markers that were indicated as putatively polymorphic by polySSR were indeed polymorphic. This greatly improves the efficiency of marker development, especially in species where there are low levels of polymorphism, like tomato. When combined with the new sequencing technologies PolySSR will have a big impact on the development of polymorphic SSRs in any species.PolySSR and the polymorphic SSR marker database are available from http://www.bioinformatics.nl/tools/polyssr/.