Dataset Information

Analysis of error profiles in deep next-generation sequencing data.

ABSTRACT:

Background

Sequencing errors are key confounding factors for detecting low-frequency genetic variants that are important for cancer molecular diagnosis, treatment, and surveillance using deep next-generation sequencing (NGS). However, there is a lack of comprehensive understanding of errors introduced at various steps of a conventional NGS workflow, such as sample handling, library preparation, PCR enrichment, and sequencing. In this study, we use current NGS technology to systematically investigate these questions.

Results

By evaluating read-specific error distributions, we discover that the substitution error rate can be computationally suppressed to 10^-5 to 10^-4, which is 10- to 100-fold lower than generally considered achievable (10^-3) in the current literature. We then quantify substitution errors attributable to sample handling, library preparation, enrichment PCR, and sequencing by using multiple deep sequencing datasets. We find that error rates differ by nucleotide substitution types, ranging from 10^-5 for A>C/T>G, C>A/G>T, and C>G/G>C changes to 10^-4 for A>G/T>C changes. Furthermore, C>T/G>A errors exhibit strong sequence context dependency, sample-specific effects dominate elevated C>A/G>T errors, and target-enrichment PCR led to ~ 6-fold increase of overall error rate. We also find that more than 70% of hotspot variants can be detected at 0.1 ~ 0.01% frequency with the current NGS technology by applying in silico error suppression.

Conclusions

We present the first comprehensive analysis of sequencing error sources in conventional NGS workflows. The error profiles revealed by our study highlight new directions for further improving NGS analysis accuracy both experimentally and computationally, ultimately enhancing the precision of deep sequencing.

SUBMITTER: Ma X

PROVIDER: S-EPMC6417284 | biostudies-literature | 2019 Mar

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Analysis of error profiles in deep next-generation sequencing data.

Ma Xiaotu X Shao Ying Y Tian Liqing L Flasch Diane A DA Mulder Heather L HL Edmonson Michael N MN Liu Yu Y Chen Xiang X Newman Scott S Nakitandwe Joy J Li Yongjin Y Li Benshang B Shen Shuhong S Wang Zhaoming Z Shurtleff Sheila S Robison Leslie L LL Levy Shawn S Easton John J Zhang Jinghui J

Genome biology 20190314 1

<h4>Background</h4>Sequencing errors are key confounding factors for detecting low-frequency genetic variants that are important for cancer molecular diagnosis, treatment, and surveillance using deep next-generation sequencing (NGS). However, there is a lack of comprehensive understanding of errors introduced at various steps of a conventional NGS workflow, such as sample handling, library preparation, PCR enrichment, and sequencing. In this study, we use current NGS technology to systematically ...[more]

PMID: 30867008

Similar Datasets

Project description:BackgroundClinical implementation of Next-Generation Sequencing (NGS) is challenged by poor control for stochastic sampling, library preparation biases and qualitative sequencing error. To address these challenges we developed and tested two hypotheses.MethodsHypothesis 1: Analytical variation in quantification is predicted by stochastic sampling effects at input of a) amplifiable nucleic acid target molecules into the library preparation, b) amplicons from library into sequencer, or c) both. We derived equations using Monte Carlo simulation to predict assay coefficient of variation (CV) based on these three working models and tested them against NGS data from specimens with well characterized molecule inputs and sequence counts prepared using competitive multiplex-PCR amplicon-based NGS library preparation method comprising synthetic internal standards (IS). Hypothesis 2: Frequencies of technically-derived qualitative sequencing errors (i.e., base substitution, insertion and deletion) observed at each base position in each target native template (NT) are concordant with those observed in respective competitive synthetic IS present in the same reaction. We measured error frequencies at each base position within amplicons from each of 30 target NT, then tested whether they correspond to those within the 30 respective IS.ResultsFor hypothesis 1, the Monte Carlo model derived from both sampling events best predicted CV and explained 74% of observed assay variance. For hypothesis 2, observed frequency and type of sequence variation at each base position within each IS was concordant with that observed in respective NTs (R2 = 0.93).ConclusionIn targeted NGS, synthetic competitive IS control for stochastic sampling at input of both target into library preparation and of target library product into sequencer, and control for qualitative errors generated during library preparation and sequencing. These controls enable accurate clinical diagnostic reporting of confidence limits and limit of detection for copy number measurement, and of frequency for each actionable mutation.

Dataset Information

Analysis of error profiles in deep next-generation sequencing data.

Background

Results

Conclusions

Publications

Analysis of error profiles in deep next-generation sequencing data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets