Browse
Submit Data
Databases
API
Help

Dataset Information

0 Views

0 Connections

0 Citations

0 Reanalyses

0 Downloads

Omics score: 0

STAT: a fast, scalable, MinHash-based k-mer tool to assess Sequence Read Archive next-generation sequence submissions.

ABSTRACT: Sequence Read Archive submissions to the National Center for Biotechnology Information often lack useful metadata, which limits the utility of these submissions. We describe the Sequence Taxonomic Analysis Tool (STAT), a scalable k-mer-based tool for fast assessment of taxonomic diversity intrinsic to submissions, independent of metadata. We show that our MinHash-based k-mer tool is accurate and scalable, offering reliable criteria for efficient selection of data for further analysis by the scientific community, at once validating submissions while also augmenting sample metadata with reliable, searchable, taxonomic terms.

SUBMITTER: Katz KS

PROVIDER: S-EPMC8450716 | biostudies-literature | 2021 Sep

REPOSITORIES: biostudies-literature

ACCESS DATA

Json Xml

Publications

STAT: a fast, scalable, MinHash-based k-mer tool to assess Sequence Read Archive next-generation sequence submissions.

Katz Kenneth S KS Shutov Oleg O Lapoint Richard R Kimelman Michael M Brister J Rodney JR O'Sullivan Christopher C

Genome biology 20210920 1

Sequence Read Archive submissions to the National Center for Biotechnology Information often lack useful metadata, which limits the utility of these submissions. We describe the Sequence Taxonomic Analysis Tool (STAT), a scalable k-mer-based tool for fast assessment of taxonomic diversity intrinsic to submissions, independent of metadata. We show that our MinHash-based k-mer tool is accurate and scalable, offering reliable criteria for efficient selection of data for further analysis by the scie ...[more]

PMID: 34544477

Similar Datasets

Extracting allelic read counts from 250,000 human sequencing runs in Sequence Read Archive.

Project description:The Sequence Read Archive (SRA) contains over one million publicly available sequencing runs from various studies using a variety of sequencing library strategies. These data inherently contain information about underlying genomic sequence variants which we exploit to extract allelic read counts on an unprecedented scale. We reprocessed over 250,000 human sequencing runs (>1000 TB data worth of raw sequence data) into a single unified dataset of allelic read counts for nearly 300,000 variants of biomedical relevance curated by NCBI dbSNP, where germline variants were detected in a median of 912 sequencing runs, and somatic variants were detected in a median of 4,876 sequencing runs, suggesting that this dataset facilitates identification of sequencing runs that harbor variants of interest. Allelic read counts obtained using a targeted alignment were very similar to read counts obtained from whole-genome alignment. Analyzing allelic read count data for matched DNA and RNA samples from tumors, we find that RNA-seq can also recover variants identified by Whole Exome Sequencing (WXS), suggesting that reprocessed allelic read counts can support variant detection across different library strategies in SRA. This study provides a rich database of known human variants across SRA samples that can support future meta-analyses of human sequence variation.

| S-EPMC6415672 | biostudies-literature

VARUS: sampling complementary RNA reads from the sequence read archive.

Project description:BACKGROUND:Vast amounts of next generation sequencing RNA data has been deposited in archives, accompanying very diverse original studies. The data is readily available also for other purposes such as genome annotation or transcriptome assembly. However, selecting a subset of available experiments, sequencing runs and reads for this purpose is a nontrivial task and complicated by the inhomogeneity of the data. RESULTS:This article presents the software VARUS that selects, downloads and aligns reads from NCBI's Sequence Read Archive, given only the species' binomial name and genome. VARUS automatically chooses runs from among all archived runs to randomly select subsets of reads. The objective of its online algorithm is to cover a large number of transcripts adequately when network bandwidth and computing resources are limited. For most tested species VARUS achieved both a higher sensitivity and specificity with a lower number of downloaded reads than when runs were manually selected. At the example of twelve eukaryotic genomes, we show that RNA-Seq that was sampled with VARUS is well-suited for fully-automatic genome annotation with BRAKER. CONCLUSIONS:With VARUS, genome annotation can be automatized to the extent that not even the selection and quality control of RNA-Seq has to be done manually. This introduces the possibility to have fully automatized genome annotation loops over potentially many species without incurring a loss of accuracy over a manually supervised annotation process.

| S-EPMC6842140 | biostudies-literature

MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive.

Project description:MotivationThe NCBI's Sequence Read Archive (SRA) promises great biological insight if one could analyze the data in the aggregate; however, the data remain largely underutilized, in part, due to the poor structure of the metadata associated with each sample. The rules governing submissions to the SRA do not dictate a standardized set of terms that should be used to describe the biological samples from which the sequencing data are derived. As a result, the metadata include many synonyms, spelling variants and references to outside sources of information. Furthermore, manual annotation of the data remains intractable due to the large number of samples in the archive. For these reasons, it has been difficult to perform large-scale analyses that study the relationships between biomolecular processes and phenotype across diverse diseases, tissues and cell types present in the SRA.ResultsWe present MetaSRA, a database of normalized SRA human sample-specific metadata following a schema inspired by the metadata organization of the ENCODE project. This schema involves mapping samples to terms in biomedical ontologies, labeling each sample with a sample-type category, and extracting real-valued properties. We automated these tasks via a novel computational pipeline.Availability and implementationThe MetaSRA is available at metasra.biostat.wisc.edu via both a searchable web interface and bulk downloads. Software implementing our computational pipeline is available at http://github.com/deweylab/metasra-pipeline.Contactcdewey@biostat.wisc.edu.Supplementary informationSupplementary data are available at Bioinformatics online.

| S-EPMC5870770 | biostudies-literature

Major submissions tool developments at the European Nucleotide Archive.

Project description:The European Nucleotide Archive (ENA; http://www.ebi.ac.uk/ena), Europe's primary nucleotide sequence resource, captures and presents globally comprehensive nucleic acid sequence and associated information. Covering the spectrum from raw data to assembled and functionally annotated genomes, the ENA has witnessed a dramatic growth resulting from advances in sequencing technology and ever broadening application of the methodology. During 2011, we have continued to operate and extend the broad range of ENA services. In particular, we have released major new functionality in our interactive web submission system, Webin, through developments in template-based submissions for annotated sequences and support for raw next-generation sequence read submissions.

| S-EPMC3245037 | biostudies-literature

Remapping the SRA: Drosophila melanogaster RNA-Seq data from the Sequence Read Archive

Project description:The sequence read archive (SRA) contains over 52 terabases or 482 billion reads from Drosophila melanogaster (as of June 2018). These data are massively underused by the community and include 14,423 RNA-Seq samples, that is roughly 7 times the size of modENCODE. Currently the major challenge is finding high quality datasets that are suitable for inclusion in new studies. To help the community overcome this hurdle, we re-processed all D. melanogaster RNA-Seq SRA experiments (SRXs) using an identical workflow. This workflow uses a data driven approach to identify technical metadata (i.e., strandedness and layout) for each sample in order to optimize mapping parameters. The workflow generates QC metrics, coverage tracks based on the dm6 assembly, and calculates gene level, junction level, and intergenic counts against FlyBase r6.11. This resource will allow any researcher to visualize browser tracks for any publicly available dataset, quickly identify high quality data sets for use in their own research, and download identically processed counts tables. There is a treasure trove of underused data sitting in the SRA and this work addresses the first challenge to make data integration a common laboratory practice.

2018-07-18 | GSE117217 | GEO

Jupyter notebook-based tools for building structured datasets from the Sequence Read Archive.

Project description:The Sequence Read Archive (SRA) is a large public repository that stores raw next-generation sequencing data from thousands of diverse scientific investigations. Despite its promise, reuse and re-analysis of SRA data has been challenged by the heterogeneity and poor quality of the metadata that describe its biological samples. Recently, the MetaSRA project standardized these metadata by annotating each sample with terms from biomedical ontologies. In this work, we present a pair of Jupyter notebook-based tools that utilize the MetaSRA for building structured datasets from the SRA in order to facilitate secondary analyses of the SRA's human RNA-seq data. The first tool, called the Case-Control Finder, finds suitable case and control samples for a given disease or condition where the cases and controls are matched by tissue or cell type. The second tool, called the Series Finder, finds ordered sets of samples for the purpose of addressing biological questions pertaining to changes over a numerical property such as time. These tools were the result of a three-day-long NCBI Codeathon in March 2019 held at the University of North Carolina at Chapel Hill.

| S-EPMC7445559 | biostudies-literature

<i>QuickDeconvolution</i>: fast and scalable deconvolution of linked-read sequencing data.

Project description:MotivationRecently introduced, linked-read technologies, such as the 10× chromium system, use microfluidics to tag multiple short reads from the same long fragment (50-200 kb) with a small sequence, called a barcode. They are inexpensive and easy to prepare, combining the accuracy of short-read sequencing with the long-range information of barcodes. The same barcode can be used for several different fragments, which complicates the analyses.ResultsWe present QuickDeconvolution (QD), a new software for deconvolving a set of reads sharing a barcode, i.e. separating the reads from the different fragments. QD only takes sequencing data as input, without the need for a reference genome. We show that QD outperforms existing software in terms of accuracy, speed and scalability, making it capable of deconvolving previously inaccessible data sets. In particular, we demonstrate here the first example in the literature of a successfully deconvoluted animal sequencing dataset, a 33-Gb Drosophila melanogaster dataset. We show that the taxonomic assignment of linked reads can be improved by deconvoluting reads with QD before taxonomic classification.Availability and implementationCode and instructions are available on https://github.com/RolandFaure/QuickDeconvolution.Supplementary informationSupplementary data are available at Bioinformatics Advances online.

| S-EPMC9710601 | biostudies-literature

Canu: scalable and accurate long-read assembly via adaptive <i>k</i>-mer weighting and repeat separation.

Project description:Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes remains challenging. We address these issues with Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes versus Celera Assembler 8.2. These advances result from new overlapping and assembly algorithms, including an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. We demonstrate that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either Pacific Biosciences (PacBio) or Oxford Nanopore technologies and achieves a contig NG50 of >21 Mbp on both human and Drosophila melanogaster PacBio data sets. For assembly structures that cannot be linearly represented, Canu provides graph-based assembly outputs in graphical fragment assembly (GFA) format for analysis or integration with complementary phasing and scaffolding techniques. The combination of such highly resolved assembly graphs with long-range scaffolding information promises the complete and automated assembly of complex genomes.

| S-EPMC5411767 | biostudies-literature

Scalable phylogenetic profiling using MinHash uncovers likely eukaryotic sexual reproduction genes.

Project description:Phylogenetic profiling is a computational method to predict genes involved in the same biological process by identifying protein families which tend to be jointly lost or retained across the tree of life. Phylogenetic profiling has customarily been more widely used with prokaryotes than eukaryotes, because the method is thought to require many diverse genomes. There are now many eukaryotic genomes available, but these are considerably larger, and typical phylogenetic profiling methods require at least quadratic time as a function of the number of genes. We introduce a fast, scalable phylogenetic profiling approach entitled HogProf, which leverages hierarchical orthologous groups for the construction of large profiles and locality-sensitive hashing for efficient retrieval of similar profiles. We show that the approach outperforms Enhanced Phylogenetic Tree, a phylogeny-based method, and use the tool to reconstruct networks and query for interactors of the kinetochore complex as well as conserved proteins involved in sexual reproduction: Hap2, Spo11 and Gex1. HogProf enables large-scale phylogenetic profiling across the three domains of life, and will be useful to predict biological pathways among the hundreds of thousands of eukaryotic species that will become available in the coming few years. HogProf is available at https://github.com/DessimozLab/HogProf.

| S-EPMC7423146 | biostudies-literature

Mash: fast genome and metagenome distance estimation using MinHash.

Project description:Mash extends the MinHash dimensionality-reduction technique to include a pairwise mutation distance and P value significance test, enabling the efficient clustering and search of massive sequence collections. Mash reduces large sequences and sequence sets to small, representative sketches, from which global mutation distances can be rapidly estimated. We demonstrate several use cases, including the clustering of all 54,118 NCBI RefSeq genomes in 33 CPU h; real-time database search using assembled or unassembled Illumina, Pacific Biosciences, and Oxford Nanopore data; and the scalable clustering of hundreds of metagenomic samples by composition. Mash is freely released under a BSD license ( https://github.com/marbl/mash ).

| S-EPMC4915045 | biostudies-literature

OmicsDI is part of the ELIXIR infrastructure

OmicsDI is an Elixir interoperability service. Learn more ›

Tweets

OmicsDI Databases

PRIDE
PeptideAtlas
MassIVE
JPOST Repository
Physiome Model Repository

EGA
EVA
ENA
LINCS
PAXDB
Cell Collective

MetaboLights
Metabolomics Workbench
MetabolomeExpress
GNPS
BioModels
FAIRDOMHub

ArrayExpress
dbGaP
ExpressionAtlas
GEO
NODE

Information

Databases
Help
API
Contact us
Code on GitHub
Terms of use
Submit Data