Project description:Genome sequencing is rapidly being adopted in reference labs and hospitals for bacterial outbreak investigation and diagnostics where time is critical. Seven gene multi-locus sequence typing is a standard tool for broadly classifying samples into sequence types (STs), allowing, in many cases, to rule a sample out of an outbreak, or allowing for general characteristics about a bacterial strain to be inferred. Long-read sequencing technologies, such as from Oxford Nanopore, can produce read data within minutes of an experiment starting, unlike short-read sequencing technologies which require many hours/days. However, the error rates of raw uncorrected long read data are very high. We present Krocus which can predict a ST directly from uncorrected long reads, and which was designed to consume read data as it is produced, providing results in minutes. It is the only tool which can do this from uncorrected long reads. We tested Krocus on over 700 isolates sequenced using long-read sequencing technologies from Pacific Biosciences and Oxford Nanopore. It provides STs for isolates on average within 90 s, with a sensitivity of 94% and specificity of 97% on real sample data, directly from uncorrected raw sequence reads. The software is written in Python and is available under the open source license GNU GPL version 3.
Project description:Metagenomics projects collect DNA from uncharacterized environments that may contain thousands of species per sample. One main challenge facing metagenomic analysis is phylogenetic classification of raw sequence reads into groups representing the same or similar taxa, a prerequisite for genome assembly and for analyzing the biological diversity of a sample. New sequencing technologies have made metagenomics easier, by making sequencing faster, and more difficult, by producing shorter reads than previous technologies. Classifying sequences from reads as short as 100 base pairs has until now been relatively inaccurate, requiring researchers to use older, long-read technologies. We present Phymm, a classifier for metagenomic data, that has been trained on 539 complete, curated genomes and can accurately classify reads as short as 100 base pairs, a substantial improvement over previous composition-based classification methods. We also describe how combining Phymm with sequence alignment algorithms improves accuracy.
Project description:Shorea balangeran Burk locally known as balangeran has been widely used as recommended species for tropical peat swamp forest restoration, due to the capability of these species to grow in waterlogged and dry areas. However, the information concerning genetic basis of adaptation to ecological condition variation is limited and no transcriptome study has been reported in this context. Here we reported two sets of transcriptome data from a sample of leaf and basal stem that were taken from seedlings growing in potted media containing peat and mineral soil. The raw reads are stored in the DDBJ platform with accession number DRA008633.
Project description:Soil metagenomics has been touted as the "grand challenge" for metagenomics, as the high microbial diversity and spatial heterogeneity of soils make them unamenable to current assembly platforms. Here, we aimed to improve soil metagenomic sequence assembly by applying the Moleculo synthetic long-read sequencing technology. In total, we obtained 267 Gbp of raw sequence data from a native prairie soil; these data included 109.7 Gbp of short-read data (~100 bp) from the Joint Genome Institute (JGI), an additional 87.7 Gbp of rapid-mode read data (~250 bp), plus 69.6 Gbp (>1.5 kbp) from Moleculo sequencing. The Moleculo data alone yielded over 5,600 reads of >10 kbp in length, and over 95% of the unassembled reads mapped to contigs of >1.5 kbp. Hybrid assembly of all data resulted in more than 10,000 contigs over 10 kbp in length. We mapped three replicate metatranscriptomes derived from the same parent soil to the Moleculo subassembly and found that 95% of the predicted genes, based on their assignments to Enzyme Commission (EC) numbers, were expressed. The Moleculo subassembly also enabled binning of >100 microbial genome bins. We obtained via direct binning the first complete genome, that of "<i>Candidatus</i> Pseudomonas sp. strain JKJ-1" from a native soil metagenome. By mapping metatranscriptome sequence reads back to the bins, we found that several bins corresponding to low-relative-abundance <i>Acidobacteria</i> were highly transcriptionally active, whereas bins corresponding to high-relative-abundance <i>Verrucomicrobia</i> were not. These results demonstrate that Moleculo sequencing provides a significant advance for resolving complex soil microbial communities. <b>IMPORTANCE</b> Soil microorganisms carry out key processes for life on our planet, including cycling of carbon and other nutrients and supporting growth of plants. However, there is poor molecular-level understanding of their functional roles in ecosystem stability and responses to environmental perturbations. This knowledge gap is largely due to the difficulty in culturing the majority of soil microbes. Thus, use of culture-independent approaches, such as metagenomics, promises the direct assessment of the functional potential of soil microbiomes. Soil is, however, a challenge for metagenomic assembly due to its high microbial diversity and variable evenness, resulting in low coverage and uneven sampling of microbial genomes. Despite increasingly large soil metagenome data volumes (>200 Gbp), the majority of the data do not assemble. Here, we used the cutting-edge approach of synthetic long-read sequencing technology (Moleculo) to assemble soil metagenome sequence data into long contigs and used the assemblies for binning of genomes. <b>Author Video</b>: An author video summary of this article is available.
Project description:Here, we report the complete genome sequence of Pseudomonas chilensis strain ABC1, which was isolated from a soil interstitial water sample collected at the University Adolfo Ibañez, Valparaiso, Chile. We assembled PacBio reads into a single closed contig with 209× mean coverage, yielding a 4,035,896-bp sequence with 62% GC content and 3,555 predicted genes.
Project description:High-throughput, culture-independent surveys of bacterial and archaeal communities in soil have illuminated the importance of both edaphic and biotic influences on microbial diversity, yet few studies compare the relative importance of these factors. Here, we employ multiplexed pyrosequencing of the 16S rRNA gene to examine soil- and cactus-associated rhizosphere microbial communities of the Sonoran Desert and the artificial desert biome of the Biosphere2 research facility. The results of our replicate sampling approach show that microbial communities are shaped primarily by soil characteristics associated with geographic locations, while rhizosphere associations are secondary factors. We found little difference between rhizosphere communities of the ecologically similar saguaro (Carnegiea gigantea) and cardón (Pachycereus pringlei) cacti. Both rhizosphere and soil communities were dominated by the disproportionately abundant Crenarchaeota class Thermoprotei, which comprised 18.7% of 183,320 total pyrosequencing reads from a comparatively small number (1,337 or 3.7%) of the 36,162 total operational taxonomic units (OTUs). OTUs common to both soil and rhizosphere samples comprised the bulk of raw sequence reads, suggesting that the shared community of soil and rhizosphere microbes constitute common and abundant taxa, particularly in the bacterial phyla Proteobacteria, Actinobacteria, Planctomycetes, Firmicutes, Bacteroidetes, Chloroflexi, and Acidobacteria. The vast majority of OTUs, however, were rare and unique to either soil or rhizosphere communities and differed among locations dozens of kilometers apart. Several soil properties, particularly soil pH and carbon content, were significantly correlated with community diversity measurements. Our results highlight the importance of culture-independent approaches in surveying microbial communities of extreme environments.
Project description:Purpose: In order to understand the functional significance of sperm transcriptome in stallion fertility, the aim of this study was to generate a detailed body of knowledge about the sperm RNA profile that defines a normal fertile stallion. Methods: The 50 bp single-end ABI SOLiD raw reads were directly aligned with the horse reference sequence EcuCab2 using ABI aligner software (NovoalignCS version 1.00.09, novocraft.com) which uses multiple indexes in the reference genome, identifies candidate alignment locations for each primary read, and allows completion of the alignment. Results: Next generation sequencing (NGS) of total RNA from the sperm of two reproductively normal stallions generated about 70 million raw reads and more than 3 Gb of sequence per sample; over half of these aligned with the EcuCab2 reference genome. Altogether, 19,257 sequence tags with average coverage ?1 (normalized number of transcripts) were mapped in the horse genome. Conclusion: The sequence of stallion sperm transcriptome is an important foundation for the discovery of transcripts of known and novel genes, and non-coding RNAs, thus improving the annotation of the horse genome sequence draft and providing markers for evaluating stallion fertility. Reproductively fertile Stallion sperm transcriptome as revealed by RNA sequencing
Project description:The Mexican axolotl (Ambystoma mexicanum) is a critically endangered species and a fruitful amphibian model for regenerative biology. Despite growing body of research on the cellular and molecular biology of axolotl limb regeneration, microbiological aspects of this process remain poorly understood. Here, we describe bacterial 16S rRNA amplicon dataset derived from axolotl limb tissue samples in the course of limb regeneration. The raw data was obtained by sequencing V3-V4 region of 16S rRNA gene and comprised 14,569,756 paired-end raw reads generated from 21 samples. Initial data analysis using DADA2 pipeline resulted in amplicon sequence variant (ASV) table containing a total of ca. 5.9 million chimera-removed, high-quality reads and a median of 296,971 reads per sample. The data constitute a useful resource for the research on the microbiological aspects of axolotl limb regeneration and will also broadly facilitate comparative studies in the developmental and conservation biology of this critically endangered species.
Project description:The alignment of DNA sequences to proteins, allowing for frameshifts, is a classic method in sequence analysis. It can help identify pseudogenes (which accumulate mutations), analyze raw DNA and RNA sequence data (which may have frameshift sequencing errors), investigate ribosomal frameshifts, etc. Often, however, only ad hoc approximations or simulations are available to provide the statistical significance of a frameshift alignment score.We describe a method to estimate statistical significance of frameshift alignments, similar to classic BLAST statistics. (BLAST presently does not permit its alignments to include frameshifts.) We also illustrate the continuing usefulness of frameshift alignment with two 'post-genomic' applications: (i) when finding pseudogenes within the human genome, frameshift alignments show that most anciently conserved non-coding human elements are recent pseudogenes with conserved ancestral genes; and (ii) when analyzing metagenomic DNA reads from polluted soil, frameshift alignments show that most alignable metagenomic reads contain frameshifts, suggesting that metagenomic analysis needs to use frameshift alignment to derive accurate results.
Project description:BACKGROUND:Next generation sequencing (NGS) offers a rapid and comprehensive method of screening for mutations associated with retinitis pigmentosa and related disorders. However, certain sequence alterations such as large insertions or deletions may remain undetected using standard NGS pipelines. One such mutation is a recently-identified Alu insertion into the Male Germ Cell-Associated Kinase (MAK) gene, which is missed by standard NGS-based variant callers. Here, we developed an in silico method of searching NGS raw sequence reads to detect this mutation, without the need to recalculate sequence alignments or to screen every sample by PCR. METHODS:The Linux program grep was used to search for a 23 bp "probe" sequence containing the known junction sequence of the insert. A corresponding search was performed with the wildtype sequence. The matching reads were counted and further compared to the known sequences of the full wildtype and mutant genomic loci. (See https://github.com/MEEIBioinformaticsCenter/grepsearch.). RESULTS:In a test sample set consisting of eleven previously published homozygous mutants, detection of the MAK-Alu insertion was validated with 100% sensitivity and specificity. As a discovery cohort, raw NGS reads from 1,847 samples (including custom and whole exome selective capture) were searched in ~1 hour on a local computer cluster, yielding an additional five samples with MAK-Alu insertions and solving two previously unsolved pedigrees. Of these, one patient was homozygous for the insertion, one compound heterozygous with a missense change on the other allele (c. 46G>A; p.Gly16Arg), and three were heterozygous carriers. CONCLUSIONS:Using the MAK-Alu grep program proved to be a rapid and effective method of finding a known, disease-causing Alu insertion in a large cohort of patients with NGS data. This simple approach avoids wet-lab assays or computationally expensive algorithms, and could also be used for other known disease-causing insertions and deletions.