Inferring short tandem repeat variation from paired-end short reads.
ABSTRACT: The advances of high-throughput sequencing offer an unprecedented opportunity to study genetic variation. This is challenged by the difficulty of resolving variant calls in repetitive DNA regions. We present a Bayesian method to estimate repeat-length variation from paired-end sequence read data. The method makes variant calls based on deviations in sequence fragment sizes, allowing the analysis of repeats at lengths of relevance to a range of phenotypes. We demonstrate the method's ability to detect and quantify changes in repeat lengths from short read genomic sequence data across genotypes. We use the method to estimate repeat variation among 12 strains of Arabidopsis thaliana and demonstrate experimentally that our method compares favourably against existing methods. Using this method, we have identified all repeats across the genome, which are likely to be polymorphic. In addition, our predicted polymorphic repeats also included the only known repeat expansion in A. thaliana, suggesting an ability to discover potential unstable repeats.
Project description:BACKGROUND:Copy number variation (CNV) is thought to actively contribute to adaptive evolution of plant species. While many computational algorithms are available to detect copy number variation from whole genome sequencing datasets, the typical complexity of plant data likely introduces false positive calls. RESULTS:To enable reliable and comprehensive detection of CNV in plant genomes, we developed Hecaton, a novel computational workflow tailored to plants, that integrates calls from multiple state-of-the-art algorithms through a machine-learning approach. In this paper, we demonstrate that Hecaton outperforms current methods when applied to short read sequencing data of Arabidopsis thaliana, rice, maize, and tomato. Moreover, it correctly detects dispersed duplications, a type of CNV commonly found in plant species, in contrast to several state-of-the-art tools that erroneously represent this type of CNV as overlapping deletions and tandem duplications. Finally, Hecaton scales well in terms of memory usage and running time when applied to short read datasets of domesticated and wild tomato accessions. CONCLUSIONS:Hecaton provides a robust method to detect CNV in plants. We expect it to be of immediate interest to both applied and fundamental research on the relationship between genotype and phenotype in plants.
Project description:BACKGROUND: Recent studies on genome assembly from short-read sequencing data reported the limitation of this technology to reconstruct the entire genome even at very high depth coverage. We investigated the limitation from the perspective of information theory to evaluate the effect of repeats on short-read genome assembly using idealized (error-free) reads at different lengths. METHODOLOGY/PRINCIPAL FINDINGS: We define a metric H(k) to be the entropy of sequencing reads at a read length k and use the relative loss of entropy ?H(k) to measure the impact of repeats for the reconstruction of whole-genome from sequences of length k. In our experiments, we found that entropy loss correlates well with de-novo assembly coverage of a genome, and a score of ?H(k)>1% indicates a severe loss in genome reconstruction fidelity. The minimal read lengths to achieve ?H(k)<1% are different for various organisms and are independent of the genome size. For example, in order to meet the threshold of ?H(k)<1%, a read length of 60 bp is needed for the sequencing of human genome (3.2 10(9) bp) and 320 bp for the sequencing of fruit fly (1.8×10(8) bp). We also calculated the ?H(k) scores for 2725 prokaryotic chromosomes and plasmids at several read lengths. Our results indicate that the levels of repeats in different genomes are diverse and the entropy of sequencing reads provides a measurement for the repeat structures. CONCLUSIONS/SIGNIFICANCE: The proposed entropy-based measurement, which can be calculated in seconds to minutes in most cases, provides a rapid quantitative evaluation on the limitation of idealized short-read genome sequencing. Moreover, the calculation can be parallelized to scale up to large euakryotic genomes. This approach may be useful to tune the sequencing parameters to achieve better genome assemblies when a closely related genome is already available.
Project description:As one of the most studied genome rearrangements, tandem repeats have a considerable impact on genetic backgrounds of inherited diseases. Many methods designed for tandem repeat detection on reference sequences obtain high quality results. However, in the case of a de novo context, where no reference sequence is available, tandem repeat detection remains a difficult problem. The short reads obtained with the second-generation sequencing methods are not long enough to span regions that contain long repeats. This length limitation was tackled by the long reads obtained with the third-generation sequencing platforms such as Pacific Biosciences technologies. Nevertheless, the gain on the read length came with a significant increase of the error rate. The main objective of nowadays studies on long reads is to handle the high error rate up to 16%.In this paper we present MixTaR, the first de novo method for tandem repeat detection that combines the high-quality of short reads and the large length of long reads. Our hybrid algorithm uses the set of short reads for tandem repeat pattern detection based on a de Bruijn graph. These patterns are then validated using the long reads, and the tandem repeat sequences are constructed using local greedy assemblies.MixTaR is tested with both simulated and real reads from complex organisms. For a complete analysis of its robustness to errors, we use short and long reads with different error rates. The results are then analysed in terms of number of tandem repeats detected and the length of their patterns.Our method shows high precision and sensitivity. With low false positive rates even for highly erroneous reads, MixTaR is able to detect accurate tandem repeats with pattern lengths varying within a significant interval.
Project description:Repeat expansions are responsible for over 40 monogenic disorders, and undoubtedly more pathogenic repeat expansions remain to be discovered. Existing methods for detecting repeat expansions in short-read sequencing data require predefined repeat catalogs. Recent discoveries emphasize the need for methods that do not require pre-specified candidate repeats. To address this need, we introduce ExpansionHunter Denovo, an efficient catalog-free method for genome-wide repeat expansion detection. Analysis of real and simulated data shows that our method can identify large expansions of 41 out of 44 pathogenic repeats, including nine recently reported non-reference repeat expansions not discoverable via existing methods.
Project description:BACKGROUND: Despite the short length of their reads, micro-read sequencing technologies have shown their usefulness for de novo sequencing. However, especially in eukaryotic genomes, complex repeat patterns are an obstacle to large assemblies. PRINCIPAL FINDINGS: We present a novel heuristic algorithm, Pebble, which uses paired-end read information to resolve repeats and scaffold contigs to produce large-scale assemblies. In simulations, we can achieve weighted median scaffold lengths (N50) of above 1 Mbp in Bacteria and above 100 kbp in more complex organisms. Using real datasets we obtained a 96 kbp N50 in Pseudomonas syringae and a unique 147 kbp scaffold of a ferret BAC clone. We also present an efficient algorithm called Rock Band for the resolution of repeats in the case of mixed length assemblies, where different sequencing platforms are combined to obtain a cost-effective assembly. CONCLUSIONS: These algorithms extend the utility of short read only assemblies into large complex genomes. They have been implemented and made available within the open-source Velvet short-read de novo assembler.
Project description:Background: Short tandem repeats are an important source of genetic variation. They are highly mutable and repeat expansions are associated dozens of human disorders, such as Huntington's disease and spinocerebellar ataxias. Technical advantages in sequencing technology have made it possible to analyse these repeats at large scale; however, accurate genotyping is still a challenging task. We compared four different short tandem repeats genotyping tools on whole exome sequencing data to determine their genotyping performance and limits, which will aid other researchers in choosing a suitable tool and parameters for analysis. Methods: The analysis was performed on the Simons Simplex Collection dataset, where we used a novel method of evaluation with accuracy determined by the rate of homozygous calls on the X chromosome of male samples. In total we analysed 433 samples and around a million genotypes for evaluating tools on whole exome sequencing data. Results: We determined a relatively good performance of all tools when genotyping repeats of 3-6 bp in length, which could be improved with coverage and quality score filtering. However, genotyping homopolymers was challenging for all tools and a high error rate was present across different thresholds of coverage and quality scores. Interestingly, dinucleotide repeats displayed a high error rate as well, which was found to be mainly caused by the AC/TG repeats. Overall, LobSTR was able to make the most calls and was also the fastest tool, while RepeatSeq and HipSTR exhibited the lowest heterozygous error rate at low coverage. Conclusions: All tools have different strengths and weaknesses and the choice may depend on the application. In this analysis we demonstrated the effect of using different filtering parameters and offered recommendations based on the trade-off between the best accuracy of genotyping and the highest number of calls.
Project description:BACKGROUND:Repeat elements are important components of most eukaryotic genomes. Most existing tools for repeat analysis rely either on high quality reference genomes or existing repeat libraries. Thus, it is still challenging to do repeat analysis for species with highly repetitive or complex genomes which often do not have good reference genomes or annotated repeat libraries. Recently we developed a computational method called REPdenovo that constructs consensus repeat sequences directly from short sequence reads, which outperforms an existing tool called RepARK. One major issue with REPdenovo is that it doesn't perform well for repeats with relatively high divergence rates or low copy numbers. In this paper, we present an improved approach for constructing consensus repeats directly from short reads. Comparing with the original REPdenovo, the improved approach uses more repeat-related k-mers and improves repeat assembly quality using a consensus-based k-mer processing method. RESULTS:We compare the performance of the new method with REPdenovo and RepARK on Human, Arabidopsis thaliana and Drosophila melanogaster short sequencing data. And the new method fully constructs more repeats in Repbase than the original REPdenovo and RepARK, especially for repeats of higher divergence rates and lower copy number. We also apply our new method on Hummingbird data which doesn't have a known repeat library, and it constructs many repeat elements that can be validated using PacBio long reads. CONCLUSION:We propose an improved method for reconstructing repeat elements directly from short sequence reads. The results show that our new method can assemble more complete repeats than REPdenovo (and also RepARK). Our new approach has been implemented as part of the REPdenovo software package, which is available for download at https://github.com/Reedwarbler/REPdenovo .
Project description:BACKGROUND: High-throughput short read sequencing is revolutionizing genomics and systems biology research by enabling cost-effective deep coverage sequencing of genomes and transcriptomes. Error detection and correction are crucial to many short read sequencing applications including de novo genome sequencing, genome resequencing, and digital gene expression analysis. Short read error detection is typically carried out by counting the observed frequencies of kmers in reads and validating those with frequencies exceeding a threshold. In case of genomes with high repeat content, an erroneous kmer may be frequently observed if it has few nucleotide differences with valid kmers with multiple occurrences in the genome. Error detection and correction were mostly applied to genomes with low repeat content and this remains a challenging problem for genomes with high repeat content. RESULTS: We develop a statistical model and a computational method for error detection and correction in the presence of genomic repeats. We propose a method to infer genomic frequencies of kmers from their observed frequencies by analyzing the misread relationships among observed kmers. We also propose a method to estimate the threshold useful for validating kmers whose estimated genomic frequency exceeds the threshold. We demonstrate that superior error detection is achieved using these methods. Furthermore, we break away from the common assumption of uniformly distributed errors within a read, and provide a framework to model position-dependent error occurrence frequencies common to many short read platforms. Lastly, we achieve better error correction in genomes with high repeat content. AVAILABILITY: The software is implemented in C++ and is freely available under GNU GPL3 license and Boost Software V1.0 license at "http://aluru-sun.ece.iastate.edu/doku.php?id = redeem". CONCLUSIONS: We introduce a statistical framework to model sequencing errors in next-generation reads, which led to promising results in detecting and correcting errors for genomes with high repeat content.
Project description:Sickle cell disease is a common hemolytic disorder with a broad range of complications, including vaso-occlusive episodes, acute chest syndrome (ACS), pain, and stroke. Heme oxygenase-1 (gene HMOX1; protein HO-1) is the inducible, rate-limiting enzyme in the catabolism of heme and might attenuate the severity of outcomes from vaso-occlusive and hemolytic crises. A (GT)(n) dinucleotide repeat located in the promoter region of the HMOX1 gene is highly polymorphic, with long repeat lengths linked to decreased activity and inducibility. We examined this polymorphism to test the hypothesis that short alleles are associated with a decreased risk of adverse outcomes (hospitalization for pain or ACS) among a cohort of 942 children with sickle cell disease. Allele lengths varied from 13 to 45 repeats and showed a trimodal distribution. Compared with children with longer allele lengths, children with 2 shorter alleles (4%; ? 25 repeats) had lower rates of hospitalization for ACS (incidence rate ratio 0.28, 95% confidence interval, 0.10-0.81), after adjusting for sex, age, asthma, percentage of fetal hemoglobin, and ?-globin gene deletion. No relationship was identified between allele lengths and pain rate. We provide evidence that genetic variation in HMOX1 is associated with decreased rates of hospitalization for ACS, but not pain. This study is registered at www.clinicaltrials.gov as #NCT00072761.
Project description:Short tandem repeats (STRs) are hyper-mutable sequences in the human genome. They are often used in forensics and population genetics and are also the underlying cause of many genetic diseases. There are challenges associated with accurately determining the length polymorphism of STR loci in the genome by next-generation sequencing (NGS). In particular, accurate detection of pathological STR expansion is limited by the sequence read length during whole-genome analysis. We developed TREDPARSE, a software package that incorporates various cues from read alignment and paired-end distance distribution, as well as a sequence stutter model, in a probabilistic framework to infer repeat sizes for genetic loci, and we used this software to infer repeat sizes for 30 known disease loci. Using simulated data, we show that TREDPARSE outperforms other available software. We sampled the full genome sequences of 12,632 individuals to an average read depth of approximately 30× to 40× with Illumina HiSeq X. We identified 138 individuals with risk alleles at 15 STR disease loci. We validated a representative subset of the samples (n = 19) by Sanger and by Oxford Nanopore sequencing. Additionally, we validated the STR calls against known allele sizes in a set of GeT-RM reference cell-line materials (n = 6). Several STR loci that are entirely guanine or cytosines (G or C) have insufficient read evidence for inference and therefore could not be assayed precisely by TREDPARSE. TREDPARSE extends the limit of STR size detection beyond the physical sequence read length. This extension is critical because many of the disease risk cutoffs are close to or beyond the short sequence read length of 100 to 150 bases.