SVachra: a tool to identify genomic structural variation in mate pair sequencing data containing inward and outward facing reads.
ABSTRACT: Characterization of genomic structural variation (SV) is essential to expanding the research and clinical applications of genome sequencing. Reliance upon short DNA fragment paired end sequencing has yielded a wealth of single nucleotide variants and internal sequencing read insertions-deletions, at the cost of limited SV detection. Multi-kilobase DNA fragment mate pair sequencing has supplemented the void in SV detection, but introduced new analytic challenges requiring SV detection tools specifically designed for mate pair sequencing data. Here, we introduce SVachra - Structural Variation Assessment of CHRomosomal Aberrations, a breakpoint calling program that identifies large insertions-deletions, inversions, inter- and intra-chromosomal translocations utilizing both inward and outward facing read types generated by mate pair sequencing.We demonstrate SVachra's utility by executing the program on large-insert (Illumina Nextera) mate pair sequencing data from the personal genome of a single subject (HS1011). An additional data set of long-read (Pacific BioSciences RSII) was also generated to validate SV calls from SVachra and other comparison SV calling programs. SVachra exhibited the highest validation rate and reported the widest distribution of SV types and size ranges when compared to other SV callers.SVachra is a highly specific breakpoint calling program that exhibits a more unbiased SV detection methodology than other callers.
Project description:The identification of genomic rearrangements with high sensitivity and specificity using massively parallel sequencing remains a major challenge, particularly in precision medicine and cancer research. Here, we describe a new method for detecting rearrangements, GRIDSS (Genome Rearrangement IDentification Software Suite). GRIDSS is a multithreaded structural variant (SV) caller that performs efficient genome-wide break-end assembly prior to variant calling using a novel positional de Bruijn graph-based assembler. By combining assembly, split read, and read pair evidence using a probabilistic scoring, GRIDSS achieves high sensitivity and specificity on simulated, cell line, and patient tumor data, recently winning SV subchallenge #5 of the ICGC-TCGA DREAM8.5 Somatic Mutation Calling Challenge. On human cell line data, GRIDSS halves the false discovery rate compared to other recent methods while matching or exceeding their sensitivity. GRIDSS identifies nontemplate sequence insertions, microhomologies, and large imperfect homologies, estimates a quality score for each breakpoint, stratifies calls into high or low confidence, and supports multisample analysis.
Project description:Whole genome sequencing of paired-end reads can be applied to characterize the landscape of large somatic rearrangements of cancer genomes. Several methods for detecting structural variants with whole genome sequencing data have been developed. So far, none of these methods has combined information about abnormally mapped read pairs connecting rearranged regions and associated global copy number changes automatically inferred from the same sequencing data file. Our aim was to create a computational method that could use both types of information, i.e. normal and abnormal reads, and demonstrate that by doing so we can highly improve both sensitivity and specificity rates of structural variant prediction.We developed a computational method, SV-Bay, to detect structural variants from whole genome sequencing mate-pair or paired-end data using a probabilistic Bayesian approach. This approach takes into account depth of coverage by normal reads and abnormalities in read pair mappings. To estimate the model likelihood, SV-Bay considers GC-content and read mappability of the genome, thus making important corrections to the expected read count. For the detection of somatic variants, SV-Bay makes use of a matched normal sample when it is available. We validated SV-Bay on simulated datasets and an experimental mate-pair dataset for the CLB-GA neuroblastoma cell line. The comparison of SV-Bay with several other methods for structural variant detection demonstrated that SV-Bay has better prediction accuracy both in terms of sensitivity and false-positive detection rate.https://github.com/InstitutCurie/SV-Bayvalentina.email@example.comSupplementary data are available at Bioinformatics online.
Project description:BACKGROUND:Structural variants (SVs) in human genomes are implicated in a variety of human diseases. Long-read sequencing delivers much longer read lengths than short-read sequencing and may greatly improve SV detection. However, due to the relatively high cost of long-read sequencing, it is unclear what coverage is needed and how to optimally use the aligners and SV callers. RESULTS:In this study, we developed NextSV, a meta-caller to perform SV calling from low coverage long-read sequencing data. NextSV integrates three aligners and three SV callers and generates two integrated call sets (sensitive/stringent) for different analysis purposes. We evaluated SV calling performance of NextSV under different PacBio coverages on two personal genomes, NA12878 and HX1. Our results showed that, compared with running any single SV caller, NextSV stringent call set had higher precision and balanced accuracy (F1 score) while NextSV sensitive call set had a higher recall. At 10X coverage, the recall of NextSV sensitive call set was 93.5 to 94.1% for deletions and 87.9 to 93.2% for insertions, indicating that ~10X coverage might be an optimal coverage to use in practice, considering the balance between the sequencing costs and the recall rates. We further evaluated the Mendelian errors on an Ashkenazi Jewish trio dataset. CONCLUSIONS:Our results provide useful guidelines for SV detection from low coverage whole-genome PacBio data and we expect that NextSV will facilitate the analysis of SVs on long-read sequencing data.
Project description:Recent studies have demonstrated the genetic significance of insertions, deletions, and other more complex structural variants (SVs) in the human population. With the development of the next-generation sequencing technologies, high-throughput surveys of SVs on the whole-genome level have become possible. Here we present split-read identification, calibrated (SRiC), a sequence-based method for SV detection.We start by mapping each read to the reference genome in standard fashion using gapped alignment. Then to identify SVs, we score each of the many initial mappings with an assessment strategy designed to take into account both sequencing and alignment errors (e.g. scoring more highly events gapped in the center of a read). All current SV calling methods have multilevel biases in their identifications due to both experimental and computational limitations (e.g. calling more deletions than insertions). A key aspect of our approach is that we calibrate all our calls against synthetic data sets generated from simulations of high-throughput sequencing (with realistic error models). This allows us to calculate sensitivity and the positive predictive value under different parameter-value scenarios and for different classes of events (e.g. long deletions vs. short insertions). We run our calculations on representative data from the 1000 Genomes Project. Coupling the observed numbers of events on chromosome 1 with the calibrations gleaned from the simulations (for different length events) allows us to construct a relatively unbiased estimate for the total number of SVs in the human genome across a wide range of length scales. We estimate in particular that an individual genome contains ~670,000 indels/SVs.Compared with the existing read-depth and read-pair approaches for SV identification, our method can pinpoint the exact breakpoints of SV events, reveal the actual sequence content of insertions, and cover the whole size spectrum for deletions. Moreover, with the advent of the third-generation sequencing technologies that produce longer reads, we expect our method to be even more useful.
Project description:BACKGROUND:Accurate catalogs of structural variants (SVs) in mammalian genomes are necessary to elucidate the potential mechanisms that drive SV formation and to assess their functional impact. Next generation sequencing methods for SV detection are an advance on array-based methods, but are almost exclusively limited to four basic types: deletions, insertions, inversions and copy number gains. RESULTS:By visual inspection of 100 Mbp of genome to which next generation sequence data from 17 inbred mouse strains had been aligned, we identify and interpret 21 paired-end mapping patterns, which we validate by PCR. These paired-end mapping patterns reveal a greater diversity and complexity in SVs than previously recognized. In addition, Sanger-based sequence analysis of 4,176 breakpoints at 261 SV sites reveal additional complexity at approximately a quarter of structural variants analyzed. We find micro-deletions and micro-insertions at SV breakpoints, ranging from 1 to 107 bp, and SNPs that extend breakpoint micro-homology and may catalyze SV formation. CONCLUSIONS:An integrative approach using experimental analyses to train computational SV calling is essential for the accurate resolution of the architecture of SVs. We find considerable complexity in SV formation; about a quarter of SVs in the mouse are composed of a complex mixture of deletion, insertion, inversion and copy number gain. Computational methods can be adapted to identify most paired-end mapping patterns.
Project description:Copy number variation (CNV) is a common form of structural variation detected in human genomes, occurring as both constitutional and somatic events. Cytogenetic techniques like chromosomal microarray (CMA) are widely used in analyzing CNVs. However, CMA techniques cannot resolve the full nature of these structural variations (i.e. the orientation and location of associated breakpoint junctions) and must be combined with other cytogenetic techniques, such as karyotyping or FISH, to do so. This makes the development of a next-generation sequencing (NGS) approach capable of resolving both CNVs and breakpoint junctions desirable. Mate-pair sequencing (MPseq) is a NGS technology designed to find large structural rearrangements across the entire genome. Here we present an algorithm capable of performing copy number analysis from mate-pair sequencing data. The algorithm uses a step-wise procedure involving normalization, segmentation, and classification of the sequencing data. The segmentation technique combines both read depth and discordant mate-pair reads to increase the sensitivity and resolution of CNV calls. The method is particularly suited to MPseq, which is designed to detect breakpoint junctions at high resolution. This allows for the classification step to accurately calculate copy number levels at the relatively low read depth of MPseq. Here we compare results for a series of hematological cancer samples that were tested with CMA and MPseq. We demonstrate comparable sensitivity to the state-of-the-art CMA technology, with the benefit of improved breakpoint resolution. The algorithm provides a powerful analytical tool for the analysis of MPseq results in cancer.
Project description:Recently, microarrays have replaced karyotyping as a first tier test in patients with idiopathic intellectual disability and/or multiple congenital abnormalities (ID/MCA) in many laboratories. Although in about 14-18% of such patients, DNA copy-number variants (CNVs) with clinical significance can be detected, microarrays have the disadvantage of missing balanced rearrangements, as well as providing no information about the genomic architecture of structural variants (SVs) like duplications and complex rearrangements. Such information could possibly lead to a better interpretation of the clinical significance of the SV. In this study, the clinical use of mate pair next-generation sequencing was evaluated for the detection and further characterization of structural variants within the genomes of 50 ID/MCA patients. Thirty of these patients carried a chromosomal aberration that was previously detected by array CGH or karyotyping and suspected to be pathogenic. In the remaining 20 patients no causal SVs were found and only benign aberrations were detected by conventional techniques. Combined cluster and coverage analysis of the mate pair data allowed precise breakpoint detection and further refinement of previously identified balanced and (complex) unbalanced aberrations, pinpointing the causal gene for some patients. We conclude that mate pair sequencing is a powerful technology that can provide rapid and unequivocal characterization of unbalanced and balanced SVs in patient genomes and can be essential for the clinical interpretation of some SVs.
Project description:We present a pipeline to perform integrative analysis of mate-pair (MP) and paired-end (PE) genomic DNA sequencing data. Our pipeline detects structural variations (SVs) by taking aligned sequencing read pairs as input and classifying these reads into properly paired and discordantly paired categories based on their orientation and inferred insert sizes. Recurrent SV was identified from the discordant read pairs. Our pipeline takes into account genomic annotation and genome repetitive element information to increase detection specificity. Application of our pipeline to whole-genome MP and PE sequencing data from three multiple myeloma cell lines (KMS11, MM.1S, and RPMI8226) recovered known SVs, such as heterozygous TRAF3 deletion, as well as a novel experimentally validated SPI1 - ZNF287 inter-chromosomal rearrangement in the RPMI8226 cell line.
Project description:SUMMARY: We present SVDetect, a program designed to identify genomic structural variations from paired-end and mate-pair next-generation sequencing data produced by the Illumina GA and ABI SOLiD platforms. Applying both sliding-window and clustering strategies, we use anomalously mapped read pairs provided by current short read aligners to localize genomic rearrangements and classify them according to their type, e.g. large insertions-deletions, inversions, duplications and balanced or unbalanced inter-chromosomal translocations. SVDetect outputs predicted structural variants in various file formats for appropriate graphical visualization. AVAILABILITY: Source code and sample data are available at http://svdetect.sourceforge.net/
Project description:New mutations leading to structural variation (SV) in genomes-in the form of mobile element insertions, large deletions, gene duplications, and other chromosomal rearrangements-can play a key role in microbial evolution. Yet, SV is considerably more difficult to predict from short-read genome resequencing data than single-nucleotide substitutions and indels (SN), so it is not yet routinely identified in studies that profile population-level genetic diversity over time in evolution experiments. We implemented an algorithm for detecting polymorphic SV as part of the breseq computational pipeline. This procedure examines split-read alignments, in which the two ends of a single sequencing read match disjoint locations in the reference genome, in order to detect structural variants and estimate their frequencies within a sample. We tested our algorithm using simulated Escherichia coli data and then applied it to 500- and 1000-generation population samples from the Lenski E. coli long-term evolution experiment (LTEE). Knowledge of genes that are targets of selection in the LTEE and mutations present in previously analyzed clonal isolates allowed us to evaluate the accuracy of our procedure. Overall, SV accounted for ~25% of the genetic diversity found in these samples. By profiling rare SV, we were able to identify many cases where alternative mutations in key genes transiently competed within a single population. We also found, unexpectedly, that mutations in two genes that rose to prominence at these early time points always went extinct in the long term. Because it is not limited by the base-calling error rate of the sequencing technology, our approach for identifying rare SV in whole-population samples may have a lower detection limit than similar predictions of SNs in these data sets. We anticipate that this functionality of breseq will be useful for providing a more complete picture of genome dynamics during evolution experiments with haploid microorganisms.