A second generation framework for the analysis of microsatellites in expressed sequence tags and the development of EST-SSR markers for a conifer, Cryptomeria japonica.
ABSTRACT: BACKGROUND: Microsatellites or simple sequence repeats (SSRs) in expressed sequence tags (ESTs) are useful resources for genome analysis because of their abundance, functionality and polymorphism. The advent of commercial second generation sequencing machines has lead to new strategies for developing EST-SSR markers, necessitating the development of bioinformatic framework that can keep pace with the increasing quality and quantity of sequence data produced. We describe an open scheme for analyzing ESTs and developing EST-SSR markers from reads collected by Sanger sequencing and pyrosequencing of sugi (Cryptomeria japonica). RESULTS: We collected 141,097 sequence reads by Sanger sequencing and 1,333,444 by pyrosequencing. After trimming contaminant and low quality sequences, 118,319 Sanger and 1,201,150 pyrosequencing reads were passed to the MIRA assembler, generating 81,284 contigs that were analysed for SSRs. 4,059 SSRs were found in 3,694 (4.54%) contigs, giving an SSR frequency lower than that in seven other plant species with gene indices (5.4-21.9%). The average GC content of the SSR-containing contigs was 41.55%, compared to 40.23% for all contigs. Tri-SSRs were the most common SSRs; the most common motif was AT, which was found in 655 (46.3%) di-SSRs, followed by the AAG motif, found in 342 (25.9%) tri-SSRs. Most (72.8%) tri-SSRs were in coding regions, but 55.6% of the di-SSRs were in non-coding regions; the AT motif was most abundant in 3' untranslated regions. Gene ontology (GO) annotations showed that six GO terms were significantly overrepresented within SSR-containing contigs. Forty-four EST-SSR markers were developed from 192 primer pairs using two pipelines: read2Marker and the newly-developed CMiB, which combines several open tools. Markers resulting from both pipelines showed no differences in PCR success rate and polymorphisms, but PCR success and polymorphism were significantly affected by the expected PCR product size and number of SSR repeats, respectively. EST-SSR markers exhibited less polymorphism than genomic SSRs. CONCLUSIONS: We have created a new open pipeline for developing EST-SSR markers and applied it in a comprehensive analysis of EST-SSRs and EST-SSR markers in C. japonica. The results will be useful in genomic analyses of conifers and other non-model species.
Project description:BACKGROUND: Epimedium sagittatum (Sieb. Et Zucc.) Maxim, a traditional Chinese medicinal plant species, has been used extensively as genuine medicinal materials. Certain Epimedium species are endangered due to commercial overexploition, while sustainable application studies, conservation genetics, systematics, and marker-assisted selection (MAS) of Epimedium is less-studied due to the lack of molecular markers. Here, we report a set of expressed sequence tags (ESTs) and simple sequence repeats (SSRs) identified in these ESTs for E. sagittatum. RESULTS: cDNAs of E. sagittatum are sequenced using 454 GS-FLX pyrosequencing technology. The raw reads are cleaned and assembled into a total of 76,459 consensus sequences comprising of 17,231 contigs and 59,228 singlets. About 38.5% (29,466) of the consensus sequences significantly match to the non-redundant protein database (E-value < 1e-10), 22,295 of which are further annotated using Gene Ontology (GO) terms. A total of 2,810 EST-SSRs is identified from the Epimedium EST dataset. Trinucleotide SSR is the dominant repeat type (55.2%) followed by dinucleotide (30.4%), tetranuleotide (7.3%), hexanucleotide (4.9%), and pentanucleotide (2.2%) SSR. The dominant repeat motif is AAG/CTT (23.6%) followed by AG/CT (19.3%), ACC/GGT (11.1%), AT/AT (7.5%), and AAC/GTT (5.9%). Thirty-two SSR-ESTs are randomly selected and primer pairs are synthesized for testing the transferability across 52 Epimedium species. Eighteen primer pairs (85.7%) could be successfully transferred to Epimedium species and sixteen of those show high genetic diversity with 0.35 of observed heterozygosity (Ho) and 0.65 of expected heterozygosity (He) and high number of alleles per locus (11.9). CONCLUSION: A large EST dataset with a total of 76,459 consensus sequences is generated, aiming to provide sequence information for deciphering secondary metabolism, especially for flavonoid pathway in Epimedium. A total of 2,810 EST-SSRs is identified from EST dataset and approximately 1580 EST-SSR markers are transferable. E. sagittatum EST-SSR transferability to the major Epimedium germplasm is up to 85.7%. Therefore, this EST dataset and EST-SSRs will be a powerful resource for further studies such as taxonomy, molecular breeding, genetics, genomics, and secondary metabolism in Epimedium species.
Project description:Simple Sequence Repeats (SSRs) developed from Expressed Sequence Tags (ESTs), known as EST-SSRs are most widely used and potentially valuable source of gene based markers for their high levels of crosstaxon portability, rapid and less expensive development. The EST sequence information in the publicly available databases is increasing in a faster rate. The emerging computational approach provides a better alternative process of development of SSR markers from the ESTs than the conventional methods. In the present study, 12,851 EST sequences of Camellia sinensis, downloaded from National Center for Biotechnology Information (NCBI) were mined for the development of Microsatellites. 6148 (4779 singletons and 1369 contigs) non redundant EST sequences were found after preprocessing and assembly of these sequences using various computational tools. Out of total 3822.68 kb sequence examined, 1636 (26.61%) EST sequences containing 2371 SSRs were detected with a density of 1 SSR/1.61 kb leading to development of 245 primer pairs. These mined EST-SSR markers will help further in the study of variability, mapping, evolutionary relationship in Camellia sinensis. In addition, these developed SSRs can also be applied for various studies across species.
Project description:<h4>Background</h4>Pearl millet [Pennisetum glaucum (L.) R. Br.] is a staple food and fodder crop of marginal agricultural lands of sub-Saharan Africa and the Indian subcontinent. It is also a summer forage crop in the southern USA, Australia and Latin America, and is the preferred mulch in Brazilian no-till soybean production systems. Use of molecular marker technology for pearl millet genetic improvement has been limited. Progress is hampered by insufficient numbers of PCR-compatible co-dominant markers that can be used readily in applied breeding programmes. Therefore, we sought to develop additional SSR markers for the pearl millet research community.<h4>Results</h4>A set of new pearl millet SSR markers were developed using available sequence information from 3520 expressed sequence tags (ESTs). After clustering, unigene sequences (2175 singlets and 317 contigs) were searched for the presence of SSRs. We detected 164 sequences containing SSRs (at least 14 bases in length), with a density of one per 1.75 kb of EST sequence. Di-nucleotide repeats were the most abundant followed by tri-nucleotide repeats. Ninety primer pairs were designed and tested for their ability to detect polymorphism across a panel of 11 pairs of pearl millet mapping population parental lines. Clear amplification products were obtained for 58 primer pairs. Of these, 15 were monomorphic across the panel. A subset of 21 polymorphic EST-SSRs and 6 recently developed genomic SSR markers were mapped using existing mapping populations. Linkage map positions of these EST-SSR were compared by homology search with mapped rice genomic sequences on the basis of pearl millet-rice synteny. Most new EST-SSR markers mapped to distal regions of linkage groups, often to previous gaps in these linkage maps. These new EST-SSRs are now are used by ICRISAT in pearl millet diversity assessment and marker-aided breeding programs.<h4>Conclusion</h4>This study has demonstrated the potential of EST-derived SSR primer pairs in pearl millet. As reported for other crops, EST-derived SSRs provide a cost-saving marker development option in pearl millet. Resources developed in this study have added a sizeable number of useful SSRs to the existing repertoire of circa 100 genomic SSRs that were previously available to pearl millet researchers.
Project description:BACKGROUND: Simple sequence repeat (SSR) markers are highly informative and widely used for genetic and breeding studies in several plant species. They are used for cultivar identification, variety protection, as anchor markers in genetic mapping, and in marker-assisted breeding. Currently, a limited number of SSR markers are publicly available for perennial ryegrass (Lolium perenne). We report on the exploitation of a comprehensive EST collection in L. perenne for SSR identification. The objectives of this study were 1) to analyse the frequency, type, and distribution of SSR motifs in ESTs derived from three genotypes of L. perenne, 2) to perform a comparative analysis of SSR motif polymorphisms between allelic sequences, 3) to conduct a comparative analysis of SSR motif polymorphisms between orthologous sequences of L. perenne, Festuca arundinacea, Brachypodium distachyon, and O. sativa, 4) to identify functionally associated EST-SSR markers for application in comparative genomics and breeding. RESULTS: From 25,744 ESTs, representing 8.53 megabases of nucleotide information from three genotypes of L. perenne, 1,458 ESTs (5.7%) contained one or more SSRs. Of these SSRs, 955 (3.7%) were non-redundant. Tri-nucleotide repeats were the most abundant type of repeats followed by di- and tetra-nucleotide repeats. The EST-SSRs from the three genotypes were analysed for allelic- and/or genotypic SSR motif polymorphisms. Most of the SSR motifs (97.7%) showed no polymorphisms, whereas 22 EST-SSRs showed allelic- and/or genotypic polymorphisms. All polymorphisms identified were changes in the number of repeat units. Comparative analysis of the L. perenne EST-SSRs with sequences of Festuca arundinacea, Brachypodium distachyon, and Oryza sativa identified 19 clusters of orthologous sequences between these four species. Analysis of the clusters showed that the SSR motif generally is conserved in the closely related species F. arundinacea, but often differs in length of the SSR motif. In contrast, SSR motifs are often lost in the more distant related species B. distachyon and O. sativa. CONCLUSION: The results indicate that the L. perenne EST-SSR markers are a valuable resource for genetic mapping, as well as evaluation of co-location between QTLs and functionally associated markers.
Project description:Zingiber officinale is a model spice herb, well known for its medicinal value. It is primarily a vegetatively propagated commercial crop. However, considerable diversity in its morphology, fiber content and chemoprofiles has been reported. The present study explores the utility of EST-derived markers in studying genetic diversity in different accessions of Z. officinale and their cross transferability within the Zingiberaceae family. A total of 38,115 ESTs sequences were assembled to generate 7850 contigs and 10,762 singletons. SSRs were searched in the unigenes and 515 SSR-containing ESTs were identified with a frequency of 1 SSR per 25.21 kb of the genome. These ESTs were also annotated using BLAST2GO. Primers were designed for 349 EST-SSRs and 25 primer pairs were randomly picked for EST SSR study. Out of these, 16 primer pairs could be optimized for amplification in different accessions of Z. officinale as well as other species belonging to Zingiberaceae. GES454, GES466, GES480 and GES486 markers were found to exhibit 100% cross-transferability among different members of Zingiberaceae.
Project description:BACKGROUND: Currently there exists a limited availability of genetic marker resources in sweetpotato (Ipomoea batatas), which is hindering genetic research in this species. It is necessary to develop more molecular markers for potential use in sweetpotato genetic research. With the newly developed next generation sequencing technology, large amount of transcribed sequences of sweetpotato have been generated and are available for identifying SSR markers by data mining. RESULTS: In this study, we investigated 181,615 ESTs for the identification and development of SSR markers. In total, 8,294 SSRs were identified from 7,163 SSR-containing unique ESTs. On an average, one SSR was found per 7.1 kb of EST sequence with tri-nucleotide motifs (42.9%) being the most abundant followed by di- (41.2%), tetra- (9.2%), penta- (3.7%) and hexa-nucleotide (3.1%) repeat types. The top five motifs included AG/CT (26.9%), AAG/CTT (13.5%), AT/TA (10.6%), CCG/CGG (5.8%) and AAT/ATT (4.5%). After removing possible duplicate of published EST-SSRs of sweetpotato, a total of non-repeat 7,958 SSR motifs were identified. Based on these SSR-containing sequences, 1,060 pairs of high-quality SSR primers were designed and used for validation of the amplification and assessment of the polymorphism between two parents of one mapping population (E Shu 3 Hao and Guang 2k-30) and eight accessions of cultivated sweetpotatoes. The results showed that 816 primer pairs could yield reproducible and strong amplification products, of which 195 (23.9%) and 342 (41.9%) primer pairs exhibited polymorphism between E Shu 3 Hao and Guang 2k-30 and among the 8 cultivated sweetpotatoes, respectively. CONCLUSION: This study gives an insight into the frequency, type and distribution of sweetpotato EST-SSRs and demonstrates successful development of EST-SSR markers in cultivated sweetpotato. These EST-SSR markers could enrich the current resource of molecular markers for the sweetpotato community and would be useful for qualitative and quantitative trait mapping, marker-assisted selection, evolution and genetic diversity studies in cultivated sweetpotato and related Ipomoea species.
Project description:BACKGROUND:Limited DNA sequence and DNA marker resources have been developed for Iris (Iridaceae), a monocot genus of 200-300 species in the Asparagales, several of which are horticulturally important. We mined an I. brevicaulis-I. fulva EST database for simple sequence repeats (SSRs) and developed ortholog-specific EST-SSR markers for genetic mapping and other genotyping applications in Iris. Here, we describe the abundance and other characteristics of SSRs identified in the transcript assembly (EST database) and the cross-species utility and polymorphisms of I. brevicaulis-I. fulva EST-SSR markers among wild collected ecotypes and horticulturally important cultivars. RESULTS:Collectively, 6,530 ESTs were produced from normalized leaf and root cDNA libraries of I. brevicaulis (IB72) and I. fulva (IF174), and assembled into 4,917 unigenes (1,066 contigs and 3,851 singletons). We identified 1,447 SSRs in 1,162 unigenes and developed 526 EST-SSR markers, each tracing a different unigene. Three-fourths of the EST-SSR markers (399/526) amplified alleles from IB72 and IF174 and 84% (335/399) were polymorphic between IB25 and IF174, the parents of I. brevicaulis x I. fulva mapping populations. Forty EST-SSR markers were screened for polymorphisms among 39 ecotypes or cultivars of seven species - 100% amplified alleles from wild collected ecotypes of Louisiana Iris (I.brevicaulis, I.fulva, I. nelsonii, and I. hexagona), whereas 42-52% amplified alleles from cultivars of three horticulturally important species (I. pseudacorus, I. germanica, and I. sibirica). Ecotypes and cultivars were genetically diverse - the number of alleles/locus ranged from two to 18 and mean heterozygosity was 0.76. CONCLUSION:Nearly 400 ortholog-specific EST-SSR markers were developed for comparative genetic mapping and other genotyping applications in Iris, were highly polymorphic among ecotypes and cultivars, and have broad utility for genotyping applications within the genus.
Project description:BACKGROUND: Molecular breeding of pepper (Capsicum spp.) can be accelerated by developing DNA markers associated with transcriptomes in breeding germplasm. Before the advent of next generation sequencing (NGS) technologies, the majority of sequencing data were generated by the Sanger sequencing method. By leveraging Sanger EST data, we have generated a wealth of genetic information for pepper including thousands of SNPs and Single Position Polymorphic (SPP) markers. To complement and enhance these resources, we applied NGS to three pepper genotypes: Maor, Early Jalapeño and Criollo de Morelos-334 (CM334) to identify SNPs and SSRs in the assembly of these three genotypes. RESULTS: Two pepper transcriptome assemblies were developed with different purposes. The first reference sequence, assembled by CAP3 software, comprises 31,196 contigs from >125,000 Sanger-EST sequences that were mainly derived from a Korean F1-hybrid line, Bukang. Overlapping probes were designed for 30,815 unigenes to construct a pepper Affymetrix GeneChip® microarray for whole genome analyses. In addition, custom Python scripts were used to identify 4,236 SNPs in contigs of the assembly. A total of 2,489 simple sequence repeats (SSRs) were identified from the assembly, and primers were designed for the SSRs. Annotation of contigs using Blast2GO software resulted in information for 60% of the unigenes in the assembly. The second transcriptome assembly was constructed from more than 200 million Illumina Genome Analyzer II reads (80-120 nt) using a combination of Velvet, CLC workbench and CAP3 software packages. BWA, SAMtools and in-house Perl scripts were used to identify SNPs among three pepper genotypes. The SNPs were filtered to be at least 50 bp from any intron-exon junctions as well as flanking SNPs. More than 22,000 high-quality putative SNPs were identified. Using the MISA software, 10,398 SSR markers were also identified within the Illumina transcriptome assembly and primers were designed for the identified markers. The assembly was annotated by Blast2GO and 14,740 (12%) of annotated contigs were associated with functional proteins. CONCLUSIONS: Before availability of pepper genome sequence, assembling transcriptomes of this economically important crop was required to generate thousands of high-quality molecular markers that could be used in breeding programs. In order to have a better understanding of the assembled sequences and to identify candidate genes underlying QTLs, we annotated the contigs of Sanger-EST and Illumina transcriptome assemblies. These and other information have been curated in a database that we have dedicated for pepper project.
Project description:<h4>Background</h4>Expressed Sequence Tags (ESTs) are a source of simple sequence repeats (SSRs) that can be used to develop molecular markers for genetic studies. The availability of ESTs for Quercus robur and Quercus petraea provided a unique opportunity to develop microsatellite markers to accelerate research aimed at studying adaptation of these long-lived species to their environment. As a first step toward the construction of a SSR-based linkage map of oak for quantitative trait locus (QTL) mapping, we describe the mining and survey of EST-SSRs as well as a fast and cost-effective approach (bin mapping) to assign these markers to an approximate map position. We also compared the level of polymorphism between genomic and EST-derived SSRs and address the transferability of EST-SSRs in Castanea sativa (chestnut).<h4>Results</h4>A catalogue of 103,000 Sanger ESTs was assembled into 28,024 unigenes from which 18.6% presented one or more SSR motifs. More than 42% of these SSRs corresponded to trinucleotides. Primer pairs were designed for 748 putative unigenes. Overall 37.7% (283) were found to amplify a single polymorphic locus in a reference full-sib pedigree of Quercus robur. The usefulness of these loci for establishing a genetic map was assessed using a bin mapping approach. Bin maps were constructed for the male and female parental tree for which framework linkage maps based on AFLP markers were available. The bin set consisting of 14 highly informative offspring selected based on the number and position of crossover sites. The female and male maps comprised 44 and 37 bins, with an average bin length of 16.5 cM and 20.99 cM, respectively. A total of 256 EST-SSRs were assigned to bins and their map position was further validated by linkage mapping. EST-SSRs were found to be less polymorphic than genomic SSRs, but their transferability rate to chestnut, a phylogenetically related species to oak, was higher.<h4>Conclusion</h4>We have generated a bin map for oak comprising 256 EST-SSRs. This resource constitutes a first step toward the establishment of a gene-based map for this genus that will facilitate the dissection of QTLs affecting complex traits of ecological importance.
Project description:BACKGROUND: Lack of sufficient molecular markers hinders current genetic research in peanuts (Arachis hypogaea L.). It is necessary to develop more molecular markers for potential use in peanut genetic research. With the development of peanut EST projects, a vast amount of available EST sequence data has been generated. These data offered an opportunity to identify SSR in ESTs by data mining. RESULTS: In this study, we investigated 24,238 ESTs for the identification and development of SSR markers. In total, 881 SSRs were identified from 780 SSR-containing unique ESTs. On an average, one SSR was found per 7.3 kb of EST sequence with tri-nucleotide motifs (63.9%) being the most abundant followed by di- (32.7%), tetra- (1.7%), hexa- (1.0%) and penta-nucleotide (0.7%) repeat types. The top six motifs included AG/TC (27.7%), AAG/TTC (17.4%), AAT/TTA (11.9%), ACC/TGG (7.72%), ACT/TGA (7.26%) and AT/TA (6.3%). Based on the 780 SSR-containing ESTs, a total of 290 primer pairs were successfully designed and used for validation of the amplification and assessment of the polymorphism among 22 genotypes of cultivated peanuts and 16 accessions of wild species. The results showed that 251 primer pairs yielded amplification products, of which 26 and 221 primer pairs exhibited polymorphism among the cultivated and wild species examined, respectively. Two to four alleles were found in cultivated peanuts, while 3-8 alleles presented in wild species. The apparent broad polymorphism was further confirmed by cloning and sequencing of amplified alleles. Sequence analysis of selected amplified alleles revealed that allelic diversity could be attributed mainly to differences in repeat type and length in the microsatellite regions. In addition, a few single base mutations were observed in the microsatellite flanking regions. CONCLUSION: This study gives an insight into the frequency, type and distribution of peanut EST-SSRs and demonstrates successful development of EST-SSR markers in cultivated peanut. These EST-SSR markers could enrich the current resource of molecular markers for the peanut community and would be useful for qualitative and quantitative trait mapping, marker-assisted selection, and genetic diversity studies in cultivated peanut as well as related Arachis species. All of the 251 working primer pairs with names, motifs, repeat types, primer sequences, and alleles tested in cultivated and wild species are listed in Additional File 1.