Accurate and exact CNV identification from targeted high-throughput sequence data.
ABSTRACT: Massively parallel sequencing of barcoded DNA samples significantly increases screening efficiency for clinically important genes. Short read aligners are well suited to single nucleotide and indel detection. However, methods for CNV detection from targeted enrichment are lacking. We present a method combining coverage with map information for the identification of deletions and duplications in targeted sequence data.Sequencing data is first scanned for gains and losses using a comparison of normalized coverage data between samples. CNV calls are confirmed by testing for a signature of sequences that span the CNV breakpoint. With our method, CNVs can be identified regardless of whether breakpoints are within regions targeted for sequencing. For CNVs where at least one breakpoint is within targeted sequence, exact CNV breakpoints can be identified. In a test data set of 96 subjects sequenced across ~1 Mb genomic sequence using multiplexing technology, our method detected mutations as small as 31 bp, predicted quantitative copy count, and had a low false-positive rate.Application of this method allows for identification of gains and losses in targeted sequence data, providing comprehensive mutation screening when combined with a short read aligner.
Project description:Precisely characterizing the breakpoints of copy number variants (CNVs) is crucial for assessing their functional impact. However, fewer than 10% of known germline CNVs have been mapped to the single-nucleotide level. We characterized the sequence breakpoints from a dataset of all CNVs detected in three unrelated individuals in previous array-based CNV discovery experiments. We used targeted hybridization-based DNA capture and 454 sequencing to sequence 324 CNV breakpoints, including 315 deletions. We observed two major breakpoint signatures: 70% of the deletion breakpoints have 1-30 bp of microhomology, whereas 33% of deletion breakpoints contain 1-367 bp of inserted sequence. The co-occurrence of microhomology and inserted sequence is low (10%), suggesting that there are at least two different mutational mechanisms. Approximately 5% of the breakpoints represent more complex rearrangements, including local microinversions, suggesting a replication-based strand switching mechanism. Despite a rich literature on DNA repair processes, reconstruction of the molecular events generating each of these mutations is not yet possible.
Project description:Copy-number variants (CNVs) are an abundant form of genetic variation in humans. However, approaches for determining exact CNV breakpoint sequences (physical deletion or duplication boundaries) across individuals, crucial for associating genotype to phenotype, have been lacking so far, and the vast majority of CNVs have been reported with approximate genomic coordinates only. Here, we report an approach, called BreakPtr, for fine-mapping CNVs (available from http://breakptr.gersteinlab.org). We statistically integrate both sequence characteristics and data from high-resolution comparative genome hybridization experiments in a discrete-valued, bivariate hidden Markov model. Incorporation of nucleotide-sequence information allows us to take into account the fact that recently duplicated sequences (e.g., segmental duplications) often coincide with breakpoints. In anticipation of an upcoming increase in CNV data, we developed an iterative, "active" approach to initially scoring with a preliminary model, performing targeted validations, retraining the model, and then rescoring, and a flexible parameterization system that intuitively collapses from a full model of 2,503 parameters to a core one of only 10. Using our approach, we accurately mapped >400 breakpoints on chromosome 22 and a region of chromosome 11, refining the boundaries of many previously approximately mapped CNVs. Four predicted breakpoints flanked known disease-associated deletions. We validated an additional four predicted CNV breakpoints by sequencing. Overall, our results suggest a predictive resolution of approximately 300 bp. This level of resolution enables more precise correlations between CNVs and across individuals than previously possible, allowing the study of CNV population frequencies. Further, it enabled us to demonstrate a clear Mendelian pattern of inheritance for one of the CNVs.
Project description:Copy number variations (CNVs) are the major type of structural variation in the human genome, and are more common than DNA sequence variations in populations. CNVs are important factors for human genetic and phenotypic diversity. Many CNVs have been associated with either resistance to diseases or identified as the cause of diseases. Currently little is known about the role of CNVs in causing deafness. CNVs are currently not analyzed by conventional genetic analysis methods to study deafness. Here we detected both DNA sequence variations and CNVs affecting 80 genes known to be required for normal hearing.Coding regions of the deafness genes were captured by a hybridization-based method and processed through the standard next-generation sequencing (NGS) protocol using the Illumina platform. Samples hybridized together in the same reaction were analyzed to obtain CNVs. A read depth based method was used to measure CNVs at the resolution of a single exon. Results were validated by the quantitative PCR (qPCR) based method.Among 79 sporadic cases clinically diagnosed with sensorineural hearing loss, we identified previously-reported disease-causing sequence mutations in 16 cases. In addition, we identified a total of 97 CNVs (72 CNV gains and 25 CNV losses) in 27 deafness genes. The CNVs included homozygous deletions which may directly give rise to deleterious effects on protein functions known to be essential for hearing, as well as heterozygous deletions and CNV gains compounded with sequence mutations in deafness genes that could potentially harm gene functions.We studied how CNVs in known deafness genes may result in deafness. Data provided here served as a basis to explain how CNVs disrupt normal functions of deafness genes. These results may significantly expand our understanding about how various types of genetic mutations cause deafness in humans.
Project description:The detailed study of breakpoints associated with copy number variants (CNVs) can elucidate the mutational mechanisms that generate them and the comparison of breakpoints across species can highlight differences in genomic architecture that may lead to lineage-specific differences in patterns of CNVs. Here, we provide a detailed analysis of Drosophila CNV breakpoints and contrast it with similar analyses recently carried out for the human genome.By applying split-read methods to a total of 10x coverage of 454 shotgun sequence across nine lines of D. melanogaster and by re-examining a previously published dataset of CNVs detected using tiling arrays, we identified the precise breakpoints of more than 600 insertions, deletions, and duplications. Contrasting these CNVs with those found in humans showed that in both taxa CNV breakpoints fall into three classes: blunt breakpoints; simple breakpoints associated with microhomology; and breakpoints with additional nucleotides inserted/deleted and no microhomology. In both taxa CNV breakpoints are enriched with non-B DNA sequence structures, which may impair DNA replication and/or repair. However, in contrast to human genomes, non-allelic homologous-recombination (NAHR) plays a negligible role in CNV formation in Drosophila. In flies, non-homologous repair mechanisms are responsible for simple, recurrent, and complex CNVs, including insertions of de novo sequence as large as 60 bp.Humans and Drosophila differ considerably in the importance of homology-based mechanisms for the formation of CNVs, likely as a consequence of the differences in the abundance and distribution of both segmental duplications and transposable elements between the two genomes.
Project description:Antimalarial resistance is a major obstacle in the eradication of the human malaria parasite, Plasmodium falciparum. Genome amplifications, a type of DNA copy number variation (CNV), facilitate overexpression of drug targets and contribute to parasite survival. Long monomeric A/T tracks are found at the breakpoints of many Plasmodium resistance-conferring CNVs. We hypothesize that other proximal sequence features, such as DNA hairpins, act with A/T tracks to trigger CNV formation. By adapting a sequence analysis pipeline to investigate previously reported CNVs, we identified breakpoints in 35 parasite clones with near single base-pair resolution. Using parental genome sequence, we predicted the formation of stable hairpins within close proximity to all future breakpoint locations. Especially stable hairpins were predicted to form near five shared breakpoints, establishing that the initiating event could have occurred at these sites. Further in-depth analyses defined characteristics of these 'trigger sites' across the genome and detected signatures of error-prone repair pathways at the breakpoints. We propose that these two genomic signals form the initial lesion (hairpins) and facilitate microhomology-mediated repair (A/T tracks) that lead to CNV formation across this highly repetitive genome. Targeting these repair pathways in P. falciparum may be used to block adaptation to antimalarial drugs.
Project description:Copy number variations (CNVs) are gain and loss of DNA sequence of a genome. High throughput platforms such as microarrays and next generation sequencing technologies (NGS) have been applied for genome wide copy number losses. Although progress has been made in both approaches, the accuracy and consistency of CNV calling from the two platforms remain in dispute. In this study, we perform a deep analysis on copy number losses on 254 human DNA samples, which have both SNP microarray data and NGS data publicly available from Hapmap Project and 1000 Genomes Project respectively. We show that the copy number losses reported from Hapmap Project and 1000 Genome Project only have < 30% overlap, while these reports are required to have cross-platform (e.g. PCR, microarray and high-throughput sequencing) experimental supporting by their corresponding projects, even though state-of-art calling methods were employed. On the other hand, copy number losses are found directly from HapMap microarray data by an accurate algorithm, i.e. CNVhac, almost all of which have lower read mapping depth in NGS data; furthermore, 88% of which can be supported by the sequences with breakpoint in NGS data. Our results suggest the ability of microarray calling CNVs and the possible introduction of false negatives from the unessential requirement of the additional cross-platform supporting. The inconsistency of CNV reports from Hapmap Project and 1000 Genomes Project might result from the inadequate information containing in microarray data, the inconsistent detection criteria, or the filtration effect of cross-platform supporting. The statistical test on CNVs called from CNVhac show that the microarray data can offer reliable CNV reports, and majority of CNV candidates can be confirmed by raw sequences. Therefore, the CNV candidates given by a good caller could be highly reliable without cross-platform supporting, so additional experimental information should be applied in need instead of necessarily.
Project description:Motivation:Single nucleotide polymorphism (SNP) array is the most widely used platform to assess somatic copy number variations (CNVs) in cancer studies. Many SNP data-based CNV callers are available, however, the false positive rates from automated calling are commonly high, and reported breakpoints can be inaccurate. Manual review for each reported CNV by visualizing the SNP data is important, but is challenging for users lacking computational experience. To address this, we present a Shiny/R application ShinyCNV, an interactive graphical user interface to view and annotate CNVs. Results:With this application, normalized SNP data, which includes log R ratio (LRR) and B allele frequency, can be plotted against the reported CNVs, and users can visually check the reliability of CNVs per se or adjust the incorrectly assigned breakpoints. Further, the interactive LRR spectrum panel within ShinyCNV can facilitate the process to identify commonly affected CNV regions from a group of samples, and to visually check if important focal gains/losses are missing from reported CNVs. ShinyCNV is designed to be intuitive for cancer researchers and can be easily installed for either personal use or deployed on servers to provide online service. Availability and implementation:ShinyCNV and the tutorial are freely available from https://github.com/gzhmat/ShinyCNV. Supplementary information:Supplementary data are available at Bioinformatics online.
Project description:Somatic copy number variations (CNVs) play a crucial role in development of many human cancers. The broad availability of next-generation sequencing data has enabled the development of algorithms to computationally infer CNV profiles from a variety of data types including exome and targeted sequence data; currently the most prevalent types of cancer genomics data. However, systemic evaluation and comparison of these tools remains challenging due to a lack of ground truth reference sets. To address this need, we have developed Bamgineer, a tool written in Python to introduce user-defined haplotype-phased allele-specific copy number events into an existing Binary Alignment Mapping (BAM) file, with a focus on targeted and exome sequencing experiments. As input, this tool requires a read alignment file (BAM format), lists of non-overlapping genome coordinates for introduction of gains and losses (bed file), and an optional file defining known haplotypes (vcf format). To improve runtime performance, Bamgineer introduces the desired CNVs in parallel using queuing and parallel processing on a local machine or on a high-performance computing cluster. As proof-of-principle, we applied Bamgineer to a single high-coverage (mean: 220X) exome sequence file from a blood sample to simulate copy number profiles of 3 exemplar tumors from each of 10 tumor types at 5 tumor cellularity levels (20-100%, 150 BAM files in total). To demonstrate feasibility beyond exome data, we introduced read alignments to a targeted 5-gene cell-free DNA sequencing library to simulate EGFR amplifications at frequencies consistent with circulating tumor DNA (10, 1, 0.1 and 0.01%) while retaining the multimodal insert size distribution of the original data. We expect Bamgineer to be of use for development and systematic benchmarking of CNV calling algorithms by users using locally-generated data for a variety of applications. The source code is freely available at http://github.com/pughlab/bamgineer.
Project description:Interpreting the genomic and phenotypic consequences of copy-number variation (CNV) is essential to understanding the etiology of genetic disorders. Whereas deletion CNVs lead obviously to haploinsufficiency, duplications might cause disease through triplosensitivity, gene disruption, or gene fusion at breakpoints. The mutational spectrum of duplications has been studied at certain loci, and in some cases these copy-number gains are complex chromosome rearrangements involving triplications and/or inversions. However, the organization of clinically relevant duplications throughout the genome has yet to be investigated on a large scale. Here we fine-mapped 184 germline duplications (14.7 kb-25.3 Mb; median 532 kb) ascertained from individuals referred for diagnostic cytogenetics testing. We performed next-generation sequencing (NGS) and whole-genome sequencing (WGS) to sequence 130 breakpoints from 112 subjects with 119 CNVs and found that most (83%) were tandem duplications in direct orientation. The remainder were triplications embedded within duplications (8.4%), adjacent duplications (4.2%), insertional translocations (2.5%), or other complex rearrangements (1.7%). Moreover, we predicted six in-frame fusion genes at sequenced duplication breakpoints; four gene fusions were formed by tandem duplications, one by two interconnected duplications, and one by duplication inserted at another locus. These unique fusion genes could be related to clinical phenotypes and warrant further study. Although most duplications are positioned head-to-tail adjacent to the original locus, those that are inverted, triplicated, or inserted can disrupt or fuse genes in a manner that might not be predicted by conventional copy-number assays. Therefore, interpreting the genetic consequences of duplication CNVs requires breakpoint-level analysis.
Project description:BACKGROUND:Targeted next-generation sequencing (NGS) is increasingly being adopted in clinical laboratories for genomic diagnostic tests. RESULTS:We developed a new computational method, DeviCNV, intended for the detection of exon-level copy number variants (CNVs) in targeted NGS data. DeviCNV builds linear regression models with bootstrapping for every probe to capture the relationship between read depth of an individual probe and the median of read depth values of all probes in the sample. From the regression models, it estimates the read depth ratio of the observed and predicted read depth with confidence interval for each probe which is applied to a circular binary segmentation (CBS) algorithm to obtain CNV candidates. Then, it assigns confidence scores to those candidates based on the reliability and strength of the CNV signals inferred from the read depth ratios of the probes within them. Finally, it also provides gene-centric plots with confidence levels of CNV candidates for visual inspection. We applied DeviCNV to targeted NGS data generated for newborn screening and demonstrated its ability to detect novel pathogenic CNVs from clinical samples. CONCLUSIONS:We propose a new pragmatic method for detecting CNVs in targeted NGS data with an intuitive visualization and a systematic method to assign confidence scores for candidate CNVs. Since DeviCNV was developed for use in clinical diagnosis, sensitivity is increased by the detection of exon-level CNVs.