Unknown

Dataset Information

0

Genotype calling from next-generation sequencing data using haplotype information of reads.


ABSTRACT: Low coverage sequencing provides an economic strategy for whole genome sequencing. When sequencing a set of individuals, genotype calling can be challenging due to low sequencing coverage. Linkage disequilibrium (LD) based refinement of genotyping calling is essential to improve the accuracy. Current LD-based methods use read counts or genotype likelihoods at individual potential polymorphic sites (PPSs). Reads that span multiple PPSs (jumping reads) can provide additional haplotype information overlooked by current methods.In this article, we introduce a new Hidden Markov Model (HMM)-based method that can take into account jumping reads information across adjacent PPSs and implement it in the HapSeq program. Our method extends the HMM in Thunder and explicitly models jumping reads information as emission probabilities conditional on the states of adjacent PPSs. Our simulation results show that, compared to Thunder, HapSeq reduces the genotyping error rate by 30%, from 0.86% to 0.60%. The results from the 1000 Genomes Project show that HapSeq reduces the genotyping error rate by 12 and 9%, from 2.24% and 2.76% to 1.97% and 2.50% for individuals with European and African ancestry, respectively. We expect our program can improve genotyping qualities of the large number of ongoing and planned whole genome sequencing projects.dzhi@ms.soph.uab.edu; kzhang@ms.soph.uab.eduThe software package HapSeq and its manual can be found and downloaded at www.ssg.uab.edu/hapseq/.Supplementary data are available at Bioinformatics online.

SUBMITTER: Zhi D 

PROVIDER: S-EPMC3493122 | biostudies-literature | 2012 Apr

REPOSITORIES: biostudies-literature

altmetric image

Publications

Genotype calling from next-generation sequencing data using haplotype information of reads.

Zhi Degui D   Wu Jihua J   Liu Nianjun N   Zhang Kui K  

Bioinformatics (Oxford, England) 20120127 7


<h4>Motivation</h4>Low coverage sequencing provides an economic strategy for whole genome sequencing. When sequencing a set of individuals, genotype calling can be challenging due to low sequencing coverage. Linkage disequilibrium (LD) based refinement of genotyping calling is essential to improve the accuracy. Current LD-based methods use read counts or genotype likelihoods at individual potential polymorphic sites (PPSs). Reads that span multiple PPSs (jumping reads) can provide additional hap  ...[more]

Similar Datasets

| S-EPMC5582667 | biostudies-literature
| S-EPMC3777110 | biostudies-literature
| S-EPMC3907006 | biostudies-literature
| S-EPMC2971572 | biostudies-literature
| S-EPMC5907718 | biostudies-literature
| S-EPMC3404070 | biostudies-literature
| S-EPMC5564424 | biostudies-literature
| S-EPMC5324109 | biostudies-literature
| S-EPMC3791270 | biostudies-literature