Integrative analysis of environmental sequences using MEGAN4.
ABSTRACT: A major challenge in the analysis of environmental sequences is data integration. The question is how to analyze different types of data in a unified approach, addressing both the taxonomic and functional aspects. To facilitate such analyses, we have substantially extended MEGAN, a widely used taxonomic analysis program. The new program, MEGAN4, provides an integrated approach to the taxonomic and functional analysis of metagenomic, metatranscriptomic, metaproteomic, and rRNA data. While taxonomic analysis is performed based on the NCBI taxonomy, functional analysis is performed using the SEED classification of subsystems and functional roles or the KEGG classification of pathways and enzymes. A number of examples illustrate how such analyses can be performed, and show that one can also import and compare classification results obtained using others' tools. MEGAN4 is freely available for academic purposes, and installers for all three major operating systems can be downloaded from www-ab.informatik.uni-tuebingen.de/software/megan.
Project description:BACKGROUND: Metagenomics is a rapidly growing field of research aimed at studying assemblages of uncultured organisms using various sequencing technologies, with the hope of understanding the true diversity of microbes, their functions, cooperation and evolution. There are two main approaches to metagenomics: amplicon sequencing, which involves PCR-targeted sequencing of a specific locus, often 16S rRNA, and random shotgun sequencing. Several tools or packages have been developed for analyzing communities using 16S rRNA sequences. Similarly, a number of tools exist for analyzing randomly sequenced DNA reads. RESULTS: We describe an extension of the metagenome analysis tool MEGAN, which allows one to analyze 16S sequences. For the analysis all 16S sequences are blasted against the SILVA database. The result output is imported into MEGAN, using a synonym file that maps the SILVA accession numbers onto the NCBI taxonomy. CONCLUSIONS: Environmental samples are often studied using both targeted 16S rRNA sequencing and random shotgun sequencing. Hence tools are needed that allow one to analyze both types of data together, and one such tool is MEGAN. The ideas presented in this paper are implemented in MEGAN 4, which is available from: http://www-ab.informatik.uni-tuebingen.de/software/megan.
Project description:Metaproteomics enables the investigation of the protein repertoire expressed by complex microbial communities. However, to unleash its full potential, refinements in bioinformatic approaches for data analysis are still needed. In this context, sequence databases selection represents a major challenge. This work assessed the impact of different databases in metaproteomic investigations by using a mock microbial mixture including nine diverse bacterial and eukaryotic species, which was subjected to shotgun metaproteomic analysis. Then, both the microbial mixture and the single microorganisms were subjected to next generation sequencing to obtain experimental metagenomic- and genomic-derived databases, which were used along with public databases (namely, NCBI, UniProtKB/SwissProt and UniProtKB/TrEMBL, parsed at different taxonomic levels) to analyze the metaproteomic dataset. First, a quantitative comparison in terms of number and overlap of peptide identifications was carried out among all databases. As a result, only 35% of peptides were common to all database classes; moreover, genus/species-specific databases provided up to 17% more identifications compared to databases with generic taxonomy, while the metagenomic database enabled a slight increment in respect to public databases. Then, database behavior in terms of false discovery rate and peptide degeneracy was critically evaluated. Public databases with generic taxonomy exhibited a markedly different trend compared to the counterparts. Finally, the reliability of taxonomic attribution according to the lowest common ancestor approach (using MEGAN and Unipept software) was assessed. The level of misassignments varied among the different databases, and specific thresholds based on the number of taxon-specific peptides were established to minimize false positives. This study confirms that database selection has a significant impact in metaproteomics, and provides critical indications for improving depth and reliability of metaproteomic results. Specifically, the use of iterative searches and of suitable filters for taxonomic assignments is proposed with the aim of increasing coverage and trustworthiness of metaproteomic data.
Project description:There is increasing interest in employing shotgun sequencing, rather than amplicon sequencing, to analyze microbiome samples. Typical projects may involve hundreds of samples and billions of sequencing reads. The comparison of such samples against a protein reference database generates billions of alignments and the analysis of such data is computationally challenging. To address this, we have substantially rewritten and extended our widely-used microbiome analysis tool MEGAN so as to facilitate the interactive analysis of the taxonomic and functional content of very large microbiome datasets. Other new features include a functional classifier called InterPro2GO, gene-centric read assembly, principal coordinate analysis of taxonomy and function, and support for metadata. The new program is called MEGAN Community Edition (CE) and is open source. By integrating MEGAN CE with our high-throughput DNA-to-protein alignment tool DIAMOND and by providing a new program MeganServer that allows access to metagenome analysis files hosted on a server, we provide a straightforward, yet powerful and complete pipeline for the analysis of metagenome shotgun sequences. We illustrate how to perform a full-scale computational analysis of a metagenomic sequencing project, involving 12 samples and 800 million reads, in less than three days on a single server. All source code is available here: https://github.com/danielhuson/megan-ce.
Project description:UNLABELLED: We present MetaRoute, an efficient search algorithm based on atom mapping rules and path weighting schemes that returns relevant or textbook-like routes between a source and a product metabolite within seconds for genome-scale networks. Its speed allows the algorithm to be used interactively through a web interface to visualize relevant routes and local networks for one or multiple organisms based on data from KEGG. AVAILABILITY: http://www-bs.informatik.uni-tuebingen.de/Services/MetaRoute. SUPPLEMENTARY INFORMATION: Supplementary details are available at http://www-bs.informatik.uni-tuebingen.de/Services/MetaRoute.
Project description:Given the absence of universal marker genes in the viral kingdom, researchers typically use BLAST (with stringent E-values) for taxonomic classification of viral metagenomic sequences. Since majority of metagenomic sequences originate from hitherto unknown viral groups, using stringent e-values results in most sequences remaining unclassified. Furthermore, using less stringent e-values results in a high number of incorrect taxonomic assignments. The SOrt-ITEMS algorithm provides an approach to address the above issues. Based on alignment parameters, SOrt-ITEMS follows an elaborate work-flow for assigning reads originating from hitherto unknown archaeal/bacterial genomes. In SOrt-ITEMS, alignment parameter thresholds were generated by observing patterns of sequence divergence within and across various taxonomic groups belonging to bacterial and archaeal kingdoms. However, many taxonomic groups within the viral kingdom lack a typical Linnean-like taxonomic hierarchy. In this paper, we present ProViDE (Program for Viral Diversity Estimation), an algorithm that uses a customized set of alignment parameter thresholds, specifically suited for viral metagenomic sequences. These thresholds capture the pattern of sequence divergence and the non-uniform taxonomic hierarchy observed within/across various taxonomic groups of the viral kingdom. Validation results indicate that the percentage of 'correct' assignments by ProViDE is around 1.7 to 3 times higher than that by the widely used similarity based method MEGAN. The misclassification rate of ProViDE is around 3 to 19% (as compared to 5 to 42% by MEGAN) indicating significantly better assignment accuracy. ProViDE software and a supplementary file (containing supplementary figures and tables referred to in this article) is available for download from http://metagenomics.atc.tcs.com/binning/ProViDE/
Project description:High throughput sequencing of 16S rRNA gene leads us into a deeper understanding on bacterial diversity for complex environmental samples, but introduces blurring due to the relatively low taxonomic capability of short read. For wastewater treatment plant, only those functional bacterial genera categorized as nutrient remediators, bulk/foaming species, and potential pathogens are significant to biological wastewater treatment and environmental impacts. Precise taxonomic assignment of these bacteria at least at genus level is important for microbial ecological research and routine wastewater treatment monitoring. Therefore, the focus of this study was to evaluate the taxonomic precisions of different ribosomal RNA (rRNA) gene hypervariable regions generated from a mix activated sludge sample. In addition, three commonly used classification methods including RDP Classifier, BLAST-based best-hit annotation, and the lowest common ancestor annotation by MEGAN were evaluated by comparing their consistency. Under an unsupervised way, analysis of consistency among different classification methods suggests there are no hypervariable regions with good taxonomic coverage for all genera. Taxonomic assignment based on certain regions of the 16S rRNA genes, e.g. the V1&V2 regions - provide fairly consistent taxonomic assignment for a relatively wide range of genera. Hence, it is recommended to use these regions for studying functional groups in activated sludge. Moreover, the inconsistency among methods also demonstrated that a specific method might not be suitable for identification of some bacterial genera using certain 16S rRNA gene regions. As a general rule, drawing conclusions based only on one sequencing region and one classification method should be avoided due to the potential false negative results.
Project description:Sequencing of taxonomic or phylogenetic markers is becoming a fast and efficient method for studying environmental microbial communities. This has resulted in a steadily growing collection of marker sequences, most notably of the small-subunit (SSU) ribosomal RNA gene, and an increased understanding of microbial phylogeny, diversity and community composition patterns. However, to utilize these large datasets together with new sequencing technologies, a reliable and flexible system for taxonomic classification is critical. We developed CREST (Classification Resources for Environmental Sequence Tags), a set of resources and tools for generating and utilizing custom taxonomies and reference datasets for classification of environmental sequences. CREST uses an alignment-based classification method with the lowest common ancestor algorithm. It also uses explicit rank similarity criteria to reduce false positives and identify novel taxa. We implemented this method in a web server, a command line tool and the graphical user interfaced program MEGAN. Further, we provide the SSU rRNA reference database and taxonomy SilvaMod, derived from the publicly available SILVA SSURef, for classification of sequences from bacteria, archaea and eukaryotes. Using cross-validation and environmental datasets, we compared the performance of CREST and SilvaMod to the RDP Classifier. We also utilized Greengenes as a reference database, both with CREST and the RDP Classifier. These analyses indicate that CREST performs better than alignment-free methods with higher recall rate (sensitivity) as well as precision, and with the ability to accurately identify most sequences from novel taxa. Classification using SilvaMod performed better than with Greengenes, particularly when applied to environmental sequences. CREST is freely available under a GNU General Public License (v3) from http://apps.cbu.uib.no/crest and http://lcaclassifier.googlecode.com.
Project description:Computational analysis of metagenomes requires the taxonomical assignment of the genome contigs assembled from DNA reads of environmental samples. Because of the diverse nature of microbiomes, the length of the assemblies obtained can vary between a few hundred bp to a few hundred Kbp. Current taxonomic classification algorithms provide accurate classification for long contigs or for short fragments from organisms that have close relatives with annotated genomes. These are significant limitations for metagenome analysis because of the complexity of microbiomes and the paucity of existing annotated genomes.We propose a robust taxonomic classification method, RAIphy, that uses a novel sequence similarity metric with iterative refinement of taxonomic models and functions effectively without these limitations. We have tested RAIphy with synthetic metagenomics data ranging between 100 bp to 50 Kbp. Within a sequence read range of 100 bp-1000 bp, the sensitivity of RAIphy ranges between 38%-81% outperforming the currently popular composition-based methods for reads in this range. Comparison with computationally more intensive sequence similarity methods shows that RAIphy performs competitively while being significantly faster. The sensitivity-specificity characteristics for relatively longer contigs were compared with the PhyloPythia and TACOA algorithms. RAIphy performs better than these algorithms at varying clade-levels. For an acid mine drainage (AMD) metagenome, RAIphy was able to taxonomically bin the sequence read set more accurately than the currently available methods, Phymm and MEGAN, and more accurately in two out of three tests than the much more computationally intensive method, PhymmBL.With the introduction of the relative abundance index metric and an iterative classification method, we propose a taxonomic classification algorithm that performs competitively for a large range of DNA contig lengths assembled from metagenome data. Because of its speed, simplicity, and accuracy RAIphy can be successfully used in the binning process for a broad range of metagenomic data obtained from environmental samples.
Project description:The etiology of dental caries is multifactorial, but frequent consumption of free sugars, notably sucrose, appears to be a major factor driving the supragingival microbiota in the direction of dysbiosis. Recent 16S rRNA-based studies indicated that caries-associated communities were less diverse than healthy supragingival plaque but still displayed considerable taxonomic diversity between individuals. Metagenomic studies likewise have found that healthy oral sites from different people were broadly similar with respect to gene function, even though there was an extensive individual variation in their taxonomic profiles. That pattern may also extend to dysbiotic communities. In that case, shifts in community-wide protein relative abundance might provide better biomarkers of dysbiosis that can be achieved through taxonomy alone.In this study, we used a paired oral microcosm biofilm model of dental caries to investigate differences in community composition and protein relative abundance in the presence and absence of sucrose. This approach provided large quantities of protein, which facilitated deep metaproteomic analysis. Community composition was evaluated using 16S rRNA sequencing and metaproteomic approaches. Although taxonomic diversity was reduced by sucrose pulsing, considerable inter-subject variation in community composition remained. By contrast, functional analysis using the SEED ontology found that sucrose induced changes in protein relative abundance patterns for pathways involving glycolysis, lactate production, aciduricity, and ammonia/glutamate metabolism that were conserved across taxonomically diverse dysbiotic oral microcosm biofilm communities.Our findings support the concept of using function-based changes in protein relative abundance as indicators of dysbiosis. Our microcosm model cannot replicate all aspects of the oral environment, but the deep level of metaproteomic analysis it allows makes it suitable for discovering which proteins are most consistently abundant during dysbiosis. It then may be possible to define biomarkers that could be used to detect at-risk tooth surfaces before the development of overt carious lesions.
Project description:Nuclear large subunit ribosomal DNA is widely used in fungal phylogenetics and to an increasing extent also amplicon-based environmental sequencing. The relatively short reads produced by next-generation sequencing, however, makes primer choice and sequence error important variables for obtaining accurate taxonomic classifications. In this simulation study we tested the performance of three classification methods: 1) a similarity-based method (BLAST + Metagenomic Analyzer, MEGAN); 2) a composition-based method (Ribosomal Database Project naïve bayesian classifier, NBC); and, 3) a phylogeny-based method (Statistical Assignment Package, SAP). We also tested the effects of sequence length, primer choice, and sequence error on classification accuracy and perceived community composition. Using a leave-one-out cross validation approach, results for classifications to the genus rank were as follows: BLAST + MEGAN had the lowest error rate and was particularly robust to sequence error; SAP accuracy was highest when long LSU query sequences were classified; and, NBC runs significantly faster than the other tested methods. All methods performed poorly with the shortest 50-100 bp sequences. Increasing simulated sequence error reduced classification accuracy. Community shifts were detected due to sequence error and primer selection even though there was no change in the underlying community composition. Short read datasets from individual primers, as well as pooled datasets, appear to only approximate the true community composition. We hope this work informs investigators of some of the factors that affect the quality and interpretation of their environmental gene surveys.