Dataset Information

Comparison of systematic sequencing errors using spike-in standards

ABSTRACT: While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting in false positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells, bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a training data set, which is typically either from a part of the data set being “recalibrated” (Genome Analysis ToolKit, or GATK) or from a separate data set with special characteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standards to human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared to conventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina base quality scores by a mean of 5 units, and by as much as 13 units at CpG sites. In addition, since reads mapping to the genome are not used for recalibration, our method allows run-specific recalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with the spike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG, and TC dinucleotides, while SOLiD has less dinucleotide SSEs but more SSEs for certain cycles. We conclude that using these DNA and RNA spike-in standards with GATK improves base quality score recalibration.

ORGANISM(S): Homo sapiens

PROVIDER: GSE36217 | GEO | 2012/03/03

SECONDARY ACCESSION(S): PRJNA153117

REPOSITORIES: GEO

ACCESS DATA

Dataset's files

Source:

			Action	DRS
		Other

Items per page:

1 - 1 of 1

Similar Datasets

Project description:While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting in false positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells, bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a training data set, which is typically either from a part of the data set being M-bM-^@M-^\recalibratedM-bM-^@M-^] (Genome Analysis ToolKit, or GATK) or from a separate data set with special characteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standards to human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared to conventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina base quality scores by a mean of 5 units, and by as much as 13 units M-BM- at CpG sites. In addition, since reads mapping to the genome are not used for recalibration, our method allows run-specific recalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with the spike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG, and TC dinucleotides, while SOLiD has less dinucleotide SSEs but more SSEs for certain cycles. We conclude that using these DNA and RNA spike-in standards with GATK improves base quality score recalibration. Four human RNA samples with equimolar ERCC spike-in standards were sequenced on Illumina. Two human brain/liver/muscle RNA mixtures with dynamic range of ERCC spike-in standards were sequenced on SOLiD.

Project description:Next generation sequencing platforms have become essential tools for understanding DNA in a wide range of contexts. Their success heavily relies on the accuracy, sensitivity and specificity of methods used to discern differences between the reference genome and genomes under investigation. Here we compare the relative performances of five popular single nucleotide variant callers with and without their associated recommended hard filtering criteria. We compare: FreeBayes; the Genome Analysis Toolkit’s Haplotype Caller and Unified Genotyper; SAMtools; and VarScan. We tailor this comparison to suit smaller projects with modest sample numbers (n = 10) and coverage (~10X) to fill a current gap in the literature. Other comparison studies are generally applicable only to larger projects in model species, where there is access to large amounts of sequencing data and curated callsets for base and variant quality score recalibration. We estimated the accuracy, sensitivity and specificity of each pipeline according to the genotype concordance rate and number with the “truth” dataset for 10 canine samples. The truth dataset was defined as genotypes obtained from the CanineHD BeadChip array. Whole genome sequencing data was performed on the Illumina HiSeq2000 or HiSeq2500 platform as 100-101 base pair, paired end reads to an average sample coverage of 10.3X. Apart from GATK Haplotype Caller, applying recommended hard filters did not improve the performance of genotyping concordance at the tested levels of minimum coverage. The default VarScan pipeline with no additional filters applied (VarScan uses SAMtools mpileup, without base alignment quality computation) generally outperformed other callers in terms of accuracy, sensitivity and specificity. The results of this study demonstrate that hard filtering of variant calls from low-powered genome studies can impair accuracy, sensitivity and specificity of callsets and provides some benchmark performance metrics on a range of low coverage levels.

Dataset Information

Comparison of systematic sequencing errors using spike-in standards

Dataset's files

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets