Strain/species identification in metagenomes using genome-specific markers.
ABSTRACT: Shotgun metagenome sequencing has become a fast, cheap and high-throughput technology for characterizing microbial communities in complex environments and human body sites. However, accurate identification of microorganisms at the strain/species level remains extremely challenging. We present a novel k-mer-based approach, termed GSMer, that identifies genome-specific markers (GSMs) from currently sequenced microbial genomes, which were then used for strain/species-level identification in metagenomes. Using 5390 sequenced microbial genomes, 8 770 321 50-mer strain-specific and 11 736 360 species-specific GSMs were identified for 4088 strains and 2005 species (4933 strains), respectively. The GSMs were first evaluated against mock community metagenomes, recently sequenced genomes and real metagenomes from different body sites, suggesting that the identified GSMs were specific to their targeting genomes. Sensitivity evaluation against synthetic metagenomes with different coverage suggested that 50 GSMs per strain were sufficient to identify most microbial strains with ?0.25× coverage, and 10% of selected GSMs in a database should be detected for confident positive callings. Application of GSMs identified 45 and 74 microbial strains/species significantly associated with type 2 diabetes patients and obese/lean individuals from corresponding gastrointestinal tract metagenomes, respectively. Our result agreed with previous studies but provided strain-level information. The approach can be directly applied to identify microbial strains/species from raw metagenomes, without the effort of complex data pre-processing.
Project description:Specific identification of microorganisms in the environment is important but challenging, especially at the species/strain level. Here, we have developed a novel k-mer-based approach to select strain/species-specific probes for microbial identification with diagnostic microarrays. Application of this approach to human microbiome genomes showed that multiple (?10 probes per strain) strain-specific 50-mer oligonucleotide probes could be designed for 2,012 of 3,421 bacterial strains of the human microbiome, and species-specific probes could be designed for most of the other strains. The method can also be used to select strain/species-specific probes for sequenced genomes in any environments, such as soil and water.
Project description:High-throughput short-read metagenomics has enabled large-scale species-level analysis and functional characterization of microbial communities. Microbiomes often contain multiple strains of the same species, and different strains have been shown to have important differences in their functional roles. Recent advances on long-read based methods enabled accurate assembly of bacterial genomes from complex microbiomes and an as-yet-unrealized opportunity to resolve strains. Here we present Strainberry, a metagenome assembly pipeline that performs strain separation in single-sample low-complexity metagenomes and that relies uniquely on long-read data. We benchmarked Strainberry on mock communities for which it produces strain-resolved assemblies with near-complete reference coverage and 99.9% base accuracy. We also applied Strainberry on real datasets for which it improved assemblies generating 20-118% additional genomic material than conventional metagenome assemblies on individual strain genomes. We show that Strainberry is also able to refine microbial diversity in a complex microbiome, with complete separation of strain genomes. We anticipate this work to be a starting point for further methodological improvements on strain-resolved metagenome assembly in environments of higher complexities.
Project description:Microbial genomes are available at an ever-increasing pace, as cultivation and sequencing become cheaper and obtaining metagenome-assembled genomes (MAGs) becomes more effective. Phylogenetic placement methods to contextualize hundreds of thousands of genomes must thus be efficiently scalable and sensitive from closely related strains to divergent phyla. We present PhyloPhlAn 3.0, an accurate, rapid, and easy-to-use method for large-scale microbial genome characterization and phylogenetic analysis at multiple levels of resolution. PhyloPhlAn 3.0 can assign genomes from isolate sequencing or MAGs to species-level genome bins built from >230,000 publically available sequences. For individual clades of interest, it reconstructs strain-level phylogenies from among the closest species using clade-specific maximally informative markers. At the other extreme of resolution, it scales to large phylogenies comprising >17,000 microbial species. Examples including Staphylococcus aureus isolates, gut metagenomes, and meta-analyses demonstrate the ability of PhyloPhlAn 3.0 to support genomic and metagenomic analyses.
Project description:Analyses of metagenomic datasets that are sequenced to a depth of billions or trillions of bases can uncover hundreds of microbial genomes, but naive assembly of these data is computationally intensive, requiring hundreds of gigabytes to terabytes of RAM. We present latent strain analysis (LSA), a scalable, de novo pre-assembly method that separates reads into biologically informed partitions and thereby enables assembly of individual genomes. LSA is implemented with a streaming calculation of unobserved variables that we call eigengenomes. Eigengenomes reflect covariance in the abundance of short, fixed-length sequences, or k-mers. As the abundance of each genome in a sample is reflected in the abundance of each k-mer in that genome, eigengenome analysis can be used to partition reads from different genomes. This partitioning can be done in fixed memory using tens of gigabytes of RAM, which makes assembly and downstream analyses of terabytes of data feasible on commodity hardware. Using LSA, we assemble partial and near-complete genomes of bacterial taxa present at relative abundances as low as 0.00001%. We also show that LSA is sensitive enough to separate reads from several strains of the same species.
Project description:The metaMicrobesOnline database (freely available at http://meta.MicrobesOnline.org) offers phylogenetic analysis of genes from microbial genomes and metagenomes. Gene trees are constructed for canonical gene families such as COG and Pfam. Such gene trees allow for rapid homologue analysis and subfamily comparison of genes from multiple metagenomes and comparisons with genes from microbial isolates. Additionally, the genome browser permits genome context comparisons, which may be used to determine the closest sequenced genome or suggest functionally associated genes. Lastly, the domain browser permits rapid comparison of protein domain organization within genes of interest from metagenomes and complete microbial genomes.
Project description:Recent big data analyses have illuminated marine microbial diversity from a global perspective, focusing on planktonic microorganisms. Here, we analyze 2.5 terabases of newly sequenced datasets and the Tara Oceans metagenomes to study the diversity of biofilm-forming marine microorganisms. We identify more than 7,300 biofilm-forming 'species' that are undetected in seawater analyses, increasing the known microbial diversity in the oceans by more than 20%, and provide evidence for differentiation across oceanic niches. Generation of a gene distribution profile reveals a functional core across the biofilms, comprised of genes from a variety of microbial phyla that may play roles in stress responses and microbe-microbe interactions. Analysis of 479 genomes reconstructed from the biofilm metagenomes reveals novel biosynthetic gene clusters and CRISPR-Cas systems. Our data highlight the previously underestimated ocean microbial diversity, and allow mining novel microbial lineages and gene resources.
Project description:Reconstructing microbial genomes from metagenomic short-read data can be challenging due to the unknown and uneven complexity of microbial communities. This complexity encompasses highly diverse populations, which often includes strain variants. Reconstructing high-quality genomes is a crucial part of the metagenomic workflow, as subsequent ecological and metabolic inferences depend on their accuracy, quality, and completeness. In contrast to microbial communities in other ecosystems, there has been no systematic assessment of genome-centric metagenomic workflows for drinking water microbiomes. In this study, we assessed the performance of a combination of assembly and binning strategies for time series drinking water metagenomes that were collected over 6 months. The goal of this study was to identify the combination of assembly and binning approaches that result in high-quality and -quantity metagenome-assembled genomes (MAGs), representing most of the sequenced metagenome. Our findings suggest that the metaSPAdes coassembly strategies had the best performance, as they resulted in larger and less fragmented assemblies, with at least 85% of the sequence data mapping to contigs greater than 1 kbp. Furthermore, a combination of metaSPAdes coassembly strategies and MetaBAT2 produced the highest number of medium-quality MAGs while capturing at least 70% of the metagenomes based on read recruitment. Utilizing different assembly/binning approaches also assists in the reconstruction of unique MAGs from closely related species that would have otherwise collapsed into a single MAG using a single workflow. Overall, our study suggests that leveraging multiple binning approaches with different metaSPAdes coassembly strategies may be required to maximize the recovery of good-quality MAGs. <b>IMPORTANCE</b> Drinking water contains phylogenetic diverse groups of bacteria, archaea, and eukarya that affect the esthetic quality of water, water infrastructure, and public health. Taxonomic, metabolic, and ecological inferences of the drinking water microbiome depend on the accuracy, quality, and completeness of genomes that are reconstructed through the application of genome-resolved metagenomics. Using time series metagenomic data, we present reproducible genome-centric metagenomic workflows that result in high-quality and -quantity genomes, which more accurately signifies the sequenced drinking water microbiome. These genome-centric metagenomic workflows will allow for improved taxonomic and functional potential analysis that offers enhanced insights into the stability and dynamics of drinking water microbial communities.
Project description:Frankia strains induce the formation of nitrogen-fixing nodules on roots of actinorhizal plants. Phylogenetically, Frankia strains can be grouped in four clusters. The earliest divergent cluster, cluster-2, has a particularly wide host range. The analysis of cluster-2 strains has been hampered by the fact that with two exceptions, they could never be cultured. In this study, 12 Frankia-enriched metagenomes of Frankia cluster-2 strains or strain assemblages were sequenced based on seven inoculum sources. Sequences obtained via DNA isolated from whole nodules were compared with those of DNA isolated from fractionated preparations enhanced in the Frankia symbiotic structures. The results show that cluster-2 inocula represent groups of strains, and that strains not represented in symbiotic structures, that is, unable to perform symbiotic nitrogen fixation, may still be able to colonize nodules. Transposase gene abundance was compared in the different Frankia-enriched metagenomes with the result that North American strains contain more transposase genes than Eurasian strains. An analysis of the evolution and distribution of the host plants indicated that bursts of transposition may have coincided with niche competition with other cluster-2 Frankia strains. The first genome of an inoculum from the Southern Hemisphere, obtained from nodules of Coriaria papuana in Papua New Guinea, represents a novel species, postulated as Candidatus Frankia meridionalis. All Frankia-enriched metagenomes obtained in this study contained homologs of the canonical nod genes nodABC; the North American genomes also contained the sulfotransferase gene nodH, while the genome from the Southern Hemisphere only contained nodC and a truncated copy of nodB.
Project description:Sequencing technologies are generating enormous amounts of read data, however assembly of genomes and metagenomes remain among the most challenging tasks. In this paper we study the comparison of genomes and metagenomes only based on read data, using word counts statistics called alignment-free thus not requiring reference genomes or assemblies. Quality scores produced by sequencing platforms are fundamental for various analyses, moreover future-generation sequencing platforms, will produce longer reads but with error rate around 15 %. In this context it will be fundamental to exploit quality values information within the framework of alignment-free measures.In this paper we present a family of alignment-free measures, called d (q) -type, that are based on k-mer counts and quality values. These statistics can be used to compare genomes and metagenomes based on their read sets. Results show that the evolutionary relationship of genomes can be reconstructed based on the direct comparison of theirs reads sets.The use of quality values on average improves the classification accuracy, and its contribution increases when the reads are more noisy. Also the comparison of metagenomic microbial communities can be performed efficiently. Similar metagenomes are quickly detected, just by processing their read data, without the need of costly alignments.
Project description:AlkB and CYP153 are important alkane hydroxylases responsible for aerobic alkane degradation in bioremediation of oil-polluted environments and microbial enhanced oil recovery. Since their distribution in nature is not clear, we made the investigation among thus-far sequenced 3,979 microbial genomes and 137 metagenomes from terrestrial, freshwater, and marine environments. Hundreds of diverse alkB and CYP153 genes including many novel ones were found in bacterial genomes, whereas none were found in archaeal genomes. Moreover, these genes were detected with different distributional patterns in the terrestrial, freshwater, and marine metagenomes. Hints for horizontal gene transfer, gene duplication, and gene fusion were found, which together are likely responsible for diversifying the alkB and CYP153 genes adapt to the ubiquitous distribution of different alkanes in nature. In addition, different distributions of these genes between bacterial genomes and metagenomes suggested the potentially important roles of unknown or less common alkane degraders in nature.