Project description:The purpose of this work was to describe a computational and analytical methodology for profiling small RNA by high-throughput sequencing. The datasets here were used to develop synthetic oligoribonucleotides as spike-in standards.
2009-06-03 | GSE14695 | GEO
Project description:Synthetic Spike-In sequencing using R2C2
Project description:While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting in false positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells, bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a training data set, which is typically either from a part of the data set being M-bM-^@M-^\recalibratedM-bM-^@M-^] (Genome Analysis ToolKit, or GATK) or from a separate data set with special characteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standards to human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared to conventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina base quality scores by a mean of 5 units, and by as much as 13 units M-BM- at CpG sites. In addition, since reads mapping to the genome are not used for recalibration, our method allows run-specific recalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with the spike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG, and TC dinucleotides, while SOLiD has less dinucleotide SSEs but more SSEs for certain cycles. We conclude that using these DNA and RNA spike-in standards with GATK improves base quality score recalibration. Four human RNA samples with equimolar ERCC spike-in standards were sequenced on Illumina. Two human brain/liver/muscle RNA mixtures with dynamic range of ERCC spike-in standards were sequenced on SOLiD.
Project description:The purpose of this work was to describe a computational and analytical methodology for profiling small RNA by high-throughput sequencing. The datasets here were used to develop synthetic oligoribonucleotides as spike-in standards. We assessed the use of synthetic oligoribonucleotide standards as spike-in controls. These standards can be used to set an objective standard against which to compare samples. Standards were added to the total RNA (100 ug) in the following amounts: Std2 (TATATGCAAGTCCGGCCATAC) 0.01 pmol, Std3 (TAGCTAACGCATATCCGCATC) 0.1 pmol, Std6 (TGAAGCTGACATCGGTCATCC) 1.0 pmol.
Project description:The phi X 174 bacteriophage was first sequenced in 1977, and has since become the most widely used standard in molecular biology and next-generation sequencing. However, with the advent of affordable DNA synthesis and de novo gene design, we considered whether we could engineer a synthetic genome, termed SynX, specifically tailored for use as a universal molecular standard. The SynX genome encodes 21 synthetic genes that can be in vitro transcribed to generate matched mRNA controls, and in vitro translated to generate matched protein controls. This enables the use of SynX as a matched control to compare across genomic, transcriptomic and proteomic experiments. The synthetic genes provide qualitative controls that measure sequencing accuracy across k-mers, GC-rich and repeat sequences, as well as act as quantitative controls that measure sensitivity and quantitative accuracy. We show how the SynX genome can measure DNA sequencing, evaluate gene expression in RNA sequencing experiments, or quantify proteins in mass spectrometry. Unlike previous spike-in controls, the SynX DNA, RNA and protein controls can be independently and sustainably prepared by recipient laboratories using common molecular biology techniques, and widely shared as a universal molecular standard.
Project description:While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting in false positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells, bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a training data set, which is typically either from a part of the data set being “recalibrated” (Genome Analysis ToolKit, or GATK) or from a separate data set with special characteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standards to human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared to conventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina base quality scores by a mean of 5 units, and by as much as 13 units at CpG sites. In addition, since reads mapping to the genome are not used for recalibration, our method allows run-specific recalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with the spike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG, and TC dinucleotides, while SOLiD has less dinucleotide SSEs but more SSEs for certain cycles. We conclude that using these DNA and RNA spike-in standards with GATK improves base quality score recalibration.
Project description:We use targeted bisulfite PCR and next-generation 454 sequencing of multiple amplicons to analyze the association of cis-regulated allele-specific methylation (ASM) with multiple complex disease-associated variants in a population of 82 individuals. We detect ASM at four variants implicated in complex phenotypes such as ulcerative colitis and AIDS progression disease (rs10491434), Celiac disease (rs2762051), Crohn’s disease, IgA nephropathy and early-onset inflammatory bowel disease (rs713875) and height (rs6569648). 82 samples analysed
Project description:A highly complex set of 264 molecular spikes, based on 11 unique spike sequences spanning different lengths (570 to 3070 nts) and GC contents (40-60%) was designed. In order to be able to precisely evaluate quantification over different expression levels, transcript lengths and GC contents, barcodes of 7 nucleotides in 2-fold abundance steps were cloned into each spike sequence (12 steps in duplicates; 24 barcodes per sequence) creating a standard curve for each spike sequence. To determine the molecular abundance of each of the 264 molecular spike-ins (i.e., the ‘ground truth’), we performed an exhaustive sequencing across the spike barcodes and spUMIs and determined the total complexity in the pool to be 76 million unique molecules
Project description:The spike protein of SARS-CoV-2, the virus responsible for the global pandemic of COVID-19, is an abundant, heavily glycosylated surface protein that plays a key role in receptor binding and host cell fusion, and is the focus of all current vaccine development efforts. Variants of concern are now circulating worldwide that exhibit mutations in the spike protein. Protein sequence and glycosylation variations of the spike may affect viral fitness, antigenicity, and immune evasion. Global surveillance of the virus currently involves genome sequencing, but tracking emerging variants should include quantitative measurement of changes in site-specific glycosylation as well. In this work, we used data-dependent acquisition (DDA) and data-independent acquisition (DIA) mass spectrometry to quantitatively characterize the five N-linked glycosylation sites of the glycoprotein standard alpha-1-acid glycoprotein (AGP), as well as the 22 sites of SARS-CoV-2 spike protein. We found that DIA compared favorably to DDA in sensitivity, resulting in more assignments of low abundance glycopeptides. However, the reproducibility across replicates of DIA-identified glycopeptides was lower than that of DDA, possibly due to the difficulty of reliably assigning low abundance glycopeptides confidently. The differences in the data acquired between the two methods suggest that DIA out-performs DDA in terms of glycoprotein coverage but that overall performance is a balance of sensitivity, selectivity, and statistical confidence in glycoproteomics. We assert that these analytical and bioinformatics methods for assigning and quantifying glycoforms would benefit the process of tracking viral variants as well as for vaccine development.