Bayesian model accounting for within-class biological variability in Serial Analysis of Gene Expression (SAGE).
ABSTRACT: BACKGROUND: An important challenge for transcript counting methods such as Serial Analysis of Gene Expression (SAGE), "Digital Northern" or Massively Parallel Signature Sequencing (MPSS), is to carry out statistical analyses that account for the within-class variability, i.e., variability due to the intrinsic biological differences among sampled individuals of the same class, and not only variability due to technical sampling error. RESULTS: We introduce a Bayesian model that accounts for the within-class variability by means of mixture distribution. We show that the previously available approaches of aggregation in pools ("pseudo-libraries") and the Beta-Binomial model, are particular cases of the mixture model. We illustrate our method with a brain tumor vs. normal comparison using SAGE data from public databases. We show examples of tags regarded as differentially expressed with high significance if the within-class variability is ignored, but clearly not so significant if one accounts for it. CONCLUSION: Using available information about biological replicates, one can transform a list of candidate transcripts showing differential expression to a more reliable one. Our method is freely available, under GPL/GNU copyleft, through a user friendly web-based on-line tool or as R language scripts at supplemental web-site.
Project description:Serial Analysis of Gene Expression (SAGE) and Massively Parallel Signature Sequencing (MPSS) are powerful techniques for gene expression analysis. A crucial step in analyzing SAGE and MPSS data is the assignment of experimentally obtained tags to a known transcript. However, tag to transcript assignment is not a straightforward process since alternative tags for a given transcript can also be experimentally obtained. Here, we have evaluated the impact of Single Nucleotide Polymorphisms (SNPs) on the generation of alternative SAGE and MPSS tags. This was achieved through the construction of a reference database of SNP-associated alternative tags, which has been integrated with SAGE Genie. A total of 2020 SNP-associated alternative tags were catalogued in our reference database and at least one SNP-associated alternative tag was observed for approximately 8.6% of all known human genes. A significant fraction (61.9%) of these alternative tags matched a list of experimentally obtained tags, validating their existence. In addition, the origin of four out of five SNP-associated alternative MPSS tags was experimentally confirmed through the use of the GLGI-MPSS protocol (Generation of Long cDNA fragments for Gene Identification). The availability of our SNP-associated alternative tag database will certainly improve the interpretation of SAGE and MPSS experiments.
Project description:The Mouse SAGE Site is a web-based database of all available public libraries generated by the Serial Analysis of Gene Expression (SAGE) from various mouse tissues and cell lines. The database contains mouse SAGE libraries organized in a uniform way and provides web-based tools for browsing, comparing and searching SAGE data with reliable tag-to-gene identification. A modified approach based on the SAGEmap database is used for reliable tag identification. The Mouse SAGE Site is maintained on an ongoing basis at the Institute of Molecular Genetics, Academy of Sciences of the Czech Republic and is accessible at the internet address http://mouse.biomed.cas.cz/sage/.
Project description:<h4>Background</h4>Rice blast, caused by the fungal pathogen Magnaporthe grisea, is a devastating disease causing tremendous yield loss in rice production. The public availability of the complete genome sequence of M. grisea provides ample opportunities to understand the molecular mechanism of its pathogenesis on rice plants at the transcriptome level. To identify all the expressed genes encoded in the fungal genome, we have analyzed the mycelium and appressorium transcriptomes using massively parallel signature sequencing (MPSS), robust-long serial analysis of gene expression (RL-SAGE) and oligoarray methods.<h4>Results</h4>The MPSS analyses identified 12,531 and 12,927 distinct significant tags from mycelia and appressoria, respectively, while the RL-SAGE analysis identified 16,580 distinct significant tags from the mycelial library. When matching these 12,531 mycelial and 12,927 appressorial significant tags to the annotated CDS, 500 bp upstream and 500 bp downstream of CDS, 6,735 unique genes in mycelia and 7,686 unique genes in appressoria were identified. A total of 7,135 mycelium-specific and 7,531 appressorium-specific significant MPSS tags were identified, which correspond to 2,088 and 1,784 annotated genes, respectively, when matching to the same set of reference sequences. Nearly 85% of the significant MPSS tags from mycelia and appressoria and 65% of the significant tags from the RL-SAGE mycelium library matched to the M. grisea genome. MPSS and RL-SAGE methods supported the expression of more than 9,000 genes, representing over 80% of the predicted genes in M. grisea. About 40% of the MPSS tags and 55% of the RL-SAGE tags represent novel transcripts since they had no matches in the existing M. grisea EST collections. Over 19% of the annotated genes were found to produce both sense and antisense tags in the protein-coding region. The oligoarray analysis identified the expression of 3,793 mycelium-specific and 4,652 appressorium-specific genes. A total of 2,430 mycelial genes and 1,886 appressorial genes were identified by both MPSS and oligoarray.<h4>Conclusion</h4>The comprehensive and deep transcriptome analysis by MPSS and RL-SAGE methods identified many novel sense and antisense transcripts in the M. grisea genome at two important growth stages. The differentially expressed transcripts that were identified, especially those specifically expressed in appressoria, represent a genomic resource useful for gaining a better understanding of the molecular basis of M. grisea pathogenicity. Further analysis of the novel antisense transcripts will provide new insights into the regulation and function of these genes in fungal growth, development and pathogenesis in the host plants.
Project description:BACKGROUND: Serial Analysis of Gene Expression (SAGE) is a method of large-scale gene expression analysis that has the potential to generate the full list of mRNAs present within a cell population at a given time and their frequency. An essential step in SAGE library analysis is the unambiguous assignment of each 14 bp tag to the transcript from which it was derived. This process, called tag-to-gene mapping, represents a step that has to be improved in the analysis of SAGE libraries. Indeed, the existing web sites providing correspondence between tags and transcripts do not concern all species for which numerous EST and cDNA have already been sequenced. RESULTS: This is the reason why we designed and implemented a freely available tool called Identitag for tag identification that can be used in any species for which transcript sequences are available. Identitag is based on a relational database structure in order to allow rapid and easy storage and updating of data and, most importantly, in order to be able to precisely define identification parameters. This structure can be seen like three interconnected modules : the first one stores virtual tags extracted from a given list of transcript sequences, the second stores experimental tags observed in SAGE experiments, and the third allows the annotation of the transcript sequences used for virtual tag extraction. It therefore connects an observed tag to a virtual tag and to the sequence it comes from, and then to its functional annotation when available. Databases made from different species can be connected according to orthology relationship thus allowing the comparison of SAGE libraries between species. We successfully used Identitag to identify tags from our chicken SAGE libraries and for chicken to human SAGE tags interspecies comparison. Identitag sources are freely available on http://pbil.univ-lyon1.fr/software/identitag/ web site. CONCLUSIONS: Identitag is a flexible and powerful tool for tag identification in any single species and for interspecies comparison of SAGE libraries. It opens the way to comparative transcriptomic analysis, an emerging branch of biology.
Project description:SAGE and MPSS libraries were produced from the same RNA sample extracted from an activated CD4+ T cell clone in order to compare the ability of these techniques to indentify the full range of genes expressed in a single cell type. Keywords: Technical comparison of tag-based technologies SAGE and MPSS Overall design: One very large LongSAGE library (~500,000 tags) and three separate MPSS libraries were produced from a single RNA sample. Tags were linked to the human genome sequence and to the Ensembl database of known human genes in order to determine how many transcripts had been identified in the cell by each technique. Despite its much smaller library size, SAGE identified many more transcripts in the sample than MPSS. Because SAGE libraries may include many erroneous tags, we consider just tags from known genes and still SAGE identified more transcripts in the sample than all three MPSS libraries combined.
Project description:BACKGROUND:Deep transcriptome analysis will underpin a large fraction of post-genomic biology. 'Closed' technologies, such as microarray analysis, only detect the set of transcripts chosen for analysis, whereas 'open' e.g. tag-based technologies are capable of identifying all possible transcripts, including those that were previously uncharacterized. Although new technologies are now emerging, at present the major resources for open-type analysis are the many publicly available SAGE (serial analysis of gene expression) and MPSS (massively parallel signature sequencing) libraries. These technologies have never been compared for their utility in the context of deep transcriptome mining. RESULTS:We used a single LongSAGE library of 503,431 tags and a "classic" MPSS library of 1,744,173 tags, both prepared from the same T cell-derived RNA sample, to compare the ability of each method to probe, at considerable depth, a human cellular transcriptome. We show that even though LongSAGE is more error-prone than MPSS, our LongSAGE library nevertheless generated 6.3-fold more genome-matching (and therefore likely error-free) tags than the MPSS library. An analysis of a set of 8,132 known genes detectable by both methods, and for which there is no ambiguity about tag matching, shows that MPSS detects only half (54%) the number of transcripts identified by SAGE (3,617 versus 1,955). Analysis of two additional MPSS libraries shows that each library samples a different subset of transcripts, and that in combination the three MPSS libraries (4,274,992 tags in total) still only detect 73% of the genes identified in our test set using SAGE. The fraction of transcripts detected by MPSS is likely to be even lower for uncharacterized transcripts, which tend to be more weakly expressed. The source of the loss of complexity in MPSS libraries compared to SAGE is unclear, but its effects become more severe with each sequencing cycle (i.e. as MPSS tag length increases). CONCLUSION:We show that MPSS libraries are significantly less complex than much smaller SAGE libraries, revealing a serious bias in the generation of MPSS data unlikely to have been circumvented by later technological improvements. Our results emphasize the need for the rigorous testing of new expression profiling technologies.
Project description:GermSAGE is a comprehensive web-based database generated by Serial Analysis of Gene Expression (SAGE) representing major stages in mouse male germ cell development, with 150,000 sequence tags in each SAGE library. A total of 452,095 tags derived from type A spermatogonia (Spga), pachytene spermatocytes (Spcy) and round spermatids (Sptd) were included. GermSAGE provides web-based tools for browsing, comparing and searching male germ cell transcriptome data at different stages with customizable searching parameters. The data can be visualized in a tabulated format or further analyzed by aligning with various annotations available in the UCSC genome browser. This flexible platform will be useful for gaining better understanding of the genetic networks that regulate spermatogonial cell renewal and differentiation, and will allow novel gene discovery. GermSAGE is freely available at http://germsage.nichd.nih.gov/
Project description:BACKGROUND: Oligoarrays have become an accessible technique for exploring the transcriptome, but it is presently unclear how absolute transcript data from this technique compare to the data achieved with tag-based quantitative techniques, such as massively parallel signature sequencing (MPSS) and serial analysis of gene expression (SAGE). By use of the TransCount method we calculated absolute transcript concentrations from spotted oligoarray intensities, enabling direct comparisons with tag counts obtained with MPSS and SAGE. The tag counts were converted to number of transcripts per cell by assuming that the sum of all transcripts in a single cell was 5.105. Our aim was to investigate whether the less resource demanding and more widespread oligoarray technique could provide data that were correlated to and had the same absolute scale as those obtained with MPSS and SAGE. RESULTS: A number of 1,777 unique transcripts were detected in common for the three technologies and served as the basis for our analyses. The correlations involving the oligoarray data were not weaker than, but, similar to the correlation between the MPSS and SAGE data, both when the entire concentration range was considered and at high concentrations. The data sets were more strongly correlated at high transcript concentrations than at low concentrations. On an absolute scale, the number of transcripts per cell and gene was generally higher based on oligoarrays than on MPSS and SAGE, and ranged from 1.6 to 9,705 for the 1,777 overlapping genes. The MPSS data were on same scale as the SAGE data, ranging from 0.5 to 3,180 (MPSS) and 9 to1,268 (SAGE) transcripts per cell and gene. The sum of all transcripts per cell for these genes was 3.8.105 (oligoarrays), 1.1.105 (MPSS) and 7.6.104 (SAGE), whereas the corresponding sum for all detected transcripts was 1.1.106 (oligoarrays), 2.8.105 (MPSS) and 3.8.105 (SAGE). CONCLUSION: The oligoarrays and TransCount provide quantitative transcript concentrations that are correlated to MPSS and SAGE data, but, the absolute scale of the measurements differs across the technologies. The discrepancy questions whether the sum of all transcripts within a single cell might be higher than the number of 5.105 suggested in the literature and used to convert tag counts to transcripts per cell. If so, this may explain the apparent higher transcript detection efficiency of the oligoarrays, and has to be clarified before absolute transcript concentrations can be interchanged across the technologies. The ability to obtain transcript concentrations from oligoarrays opens up the possibility of efficient generation of universal transcript databases with low resource demands.
Project description:<h4>Background</h4>In testing for differential gene expression involving multiple serial analysis of gene expression (SAGE) libraries, it is critical to account for both between and within library variation. Several methods have been proposed, including the t test, tw test, and an overdispersed logistic regression approach. The merits of these tests, however, have not been fully evaluated. Questions still remain on whether further improvements can be made.<h4>Results</h4>In this article, we introduce an overdispersed log-linear model approach to analyzing SAGE; we evaluate and compare its performance with three other tests: the two-sample t test, tw test and another based on overdispersed logistic linear regression. Analysis of simulated and real datasets show that both the log-linear and logistic overdispersion methods generally perform better than the t and tw tests; the log-linear method is further found to have better performance than the logistic method, showing equal or higher statistical power over a range of parameter values and with different data distributions.<h4>Conclusion</h4>Overdispersed log-linear models provide an attractive and reliable framework for analyzing SAGE experiments involving multiple libraries. For convenience, the implementation of this method is available through a user-friendly web-interface available at http://www.cbcb.duke.edu/sage.
Project description:A plethora of research has focused on the human embryonic stem cells (hESC) ever since they were first reported mainly due to their distinct features of self renewal and pluripotency. Probing of the hESC transcriptome using global expression profiling tools such as DNA microarray, Serial Analysis of Gene Expression (SAGE), Massively Parallel Signature Sequencing (MPSS) and Expressed Sequence Tag (EST) analysis have contributed significantly in our current understanding of hESCs. In fact, a large number of markers to assess the pluripotent and differentiation status of hESCs have been presented by such studies. However, till date clarity is lacking in identifying a robust set of markers to assess the true state of the hESCs. In this paper, we report the generation of long SAGE libraries for partially differentiated and differentiated HES3 along with a deeper profiling of our previously reported HES3 undifferentiated library. Clustering analysis of these libraries in concert with long and short SAGE libraries available in public databases using Hierarchical Cluster Analysis (HCA) as well as a poison-based approach helped in identification of expression patterns distinct from those reported for the well established pluriptency/differentiation markers. Almost all the previous studies reporting a robust set of markers have focused on a gene by gene comparison in listing out the upregulated and downregulated genes rather than look at the expression patterns in establishing a list of markers. In this analysis, however, we report a new set of markers as well as add confidence to some previously reported markers based on their novel expression patterns as identified by SAGE analysis and also confirm it by real-time PCR analysis. For the real-time PCR confirmation, instead of taking two extreme data sets such as undifferentiated and a late stage embryoid body, we profiled a time series of embryoid body stages so that we could identify those genes which show a dramatic increase or decrease upon differentiation and hence would serve as more reliable markers. The hESC lines HES3 from ES Cell International, Singapore (http://www.escellinternational.com) were cultured on mouse embryonic fibroblast feeders (MEFs) as described previously. Selection of cells for library construction was done by micro-dissection following our established protocols. Spontaneous differentiation was induced by prolonged culture without changing the feeder layer. Briefly the HES3 cell line was grown to 18 passages (P18) on MEFs, after which no sub-culturing was performed. The medium was changed daily for 25 days and the differentiated cell population was harvested at the end of the culture period by micro-dissection. Cell differentiation was confirmed by Immuno-cytochemistry for markers to the three germ layers, as well as RT-PCR for the differentiation markers. For construction of the partially differentiated library, colonies which had started differentiation, as suggested by the morphological changes towards the centre and periphery of the colonies were used. Entire HES3 colonies (24P) which had started differentiation were harvested by micro-dissection and used for library construction. MmeI was used as tagging enzyme and libraries were constructed using the LS-SAGE kit from Invitrogen (http://www.invitrogen.com). Cloning of concatemerized ditags and sequencing of tags were done as outlined earlier. The 10 bp SAGE tags were extracted using Microsoft Excel for direct comparisons between libraries.