Ontology highlight
ABSTRACT: Motivation
Reliable estimation of the mean fragment length for next-generation short-read sequencing data is an important step in next-generation sequencing analysis pipelines, most notably because of its impact on the accuracy of the enriched regions identified by peak-calling algorithms. Although many peak-calling algorithms include a fragment-length estimation subroutine, the problem has not been adequately solved, as demonstrated by the variability of the estimates returned by different algorithms.Results
In this article, we investigate the use of strand cross-correlation to estimate mean fragment length of single-end data and show that traditional estimation approaches have mixed reliability. We observe that the mappability of different parts of the genome can introduce an artificial bias into cross-correlation computations, resulting in incorrect fragment-length estimates. We propose a new approach, called mappability-sensitive cross-correlation (MaSC), which removes this bias and allows for accurate and reliable fragment-length estimation. We analyze the computational complexity of this approach, and evaluate its performance on a test suite of NGS datasets, demonstrating its superiority to traditional cross-correlation analysis.Availability
An open-source Perl implementation of our approach is available at http://www.perkinslab.ca/Software.html.
SUBMITTER: Ramachandran P
PROVIDER: S-EPMC3570216 | biostudies-literature | 2013 Feb
REPOSITORIES: biostudies-literature

Ramachandran Parameswaran P Palidwor Gareth A GA Porter Christopher J CJ Perkins Theodore J TJ
Bioinformatics (Oxford, England) 20130107 4
<h4>Motivation</h4>Reliable estimation of the mean fragment length for next-generation short-read sequencing data is an important step in next-generation sequencing analysis pipelines, most notably because of its impact on the accuracy of the enriched regions identified by peak-calling algorithms. Although many peak-calling algorithms include a fragment-length estimation subroutine, the problem has not been adequately solved, as demonstrated by the variability of the estimates returned by differ ...[more]