Automated deconvolution of structured mixtures from heterogeneous tumor genomic data.
ABSTRACT: With increasing appreciation for the extent and importance of intratumor heterogeneity, much attention in cancer research has focused on profiling heterogeneity on a single patient level. Although true single-cell genomic technologies are rapidly improving, they remain too noisy and costly at present for population-level studies. Bulk sequencing remains the standard for population-scale tumor genomics, creating a need for computational tools to separate contributions of multiple tumor clones and assorted stromal and infiltrating cell populations to pooled genomic data. All such methods are limited to coarse approximations of only a few cell subpopulations, however. In prior work, we demonstrated the feasibility of improving cell type deconvolution by taking advantage of substructure in genomic mixtures via a strategy called simplicial complex unmixing. We improve on past work by introducing enhancements to automate learning of substructured genomic mixtures, with specific emphasis on genome-wide copy number variation (CNV) data, as well as the ability to process quantitative RNA expression data, and heterogeneous combinations of RNA and CNV data. We introduce methods for dimensionality estimation to better decompose mixture model substructure; fuzzy clustering to better identify substructure in sparse, noisy data; and automated model inference methods for other key model parameters. We further demonstrate their effectiveness in identifying mixture substructure in true breast cancer CNV data from the Cancer Genome Atlas (TCGA). Source code is available at https://github.com/tedroman/WSCUnmix.
Project description:Genomic variation in Indian populations is of great interest due to the diversity of ancestral components, social stratification, endogamy and complex admixture patterns. With an expanding population of 1.2 billion, India is also a treasure trove to catalogue innocuous as well as clinically relevant rare mutations. Recent studies have revealed four dominant ancestries in populations from mainland India: Ancestral North-Indian (ANI), Ancestral South-Indian (ASI), Ancestral Tibeto-Burman (ATB) and Ancestral Austro-Asiatic (AAA). The 1000 Genomes Project (KGP) Phase-3 data include about 500 genomes from five linguistically defined Indian-Subcontinent (IS) populations (Punjabi, Gujrati, Bengali, Telugu and Tamil) some of whom are recent migrants to USA or UK. Comparative analyses show that despite the distinct geographic origins of the KGP-IS populations, the ANI component is predominantly represented in this dataset. Previous studies demonstrated population substructure in the HapMap Gujrati population, and we found evidence for additional substructure in the Punjabi and Telugu populations. These substructured populations have characteristic/significant differences in heterozygosity and inbreeding coefficients. Moreover, we demonstrate that the substructure is better explained by factors like differences in proportion of ancestral components, and endogamy driven social structure rather than invoking a novel ancestral component to explain it. Therefore, using language and/or geography as a proxy for an ethnic unit is inadequate for many of the IS populations. This highlights the necessity for more nuanced sampling strategies or corrective statistical approaches, particularly for biomedical and population genetics research in India.
Project description:BACKGROUND: Cluster analysis is an important technique for the exploratory analysis of biological data. Such data is often high-dimensional, inherently noisy and contains outliers. This makes clustering challenging. Mixtures are versatile and powerful statistical models which perform robustly for clustering in the presence of noise and have been successfully applied in a wide range of applications. RESULTS: PyMix - the Python mixture package implements algorithms and data structures for clustering with basic and advanced mixture models. The advanced models include context-specific independence mixtures, mixtures of dependence trees and semi-supervised learning. PyMix is licenced under the GNU General Public licence (GPL). PyMix has been successfully used for the analysis of biological sequence, complex disease and gene expression data. CONCLUSIONS: PyMix is a useful tool for cluster analysis of biological data. Due to the general nature of the framework, PyMix can be applied to a wide range of applications and data sets.
Project description:Copy number variation (CNV) is pervasive in the human genome and can play a causal role in genetic diseases. The functional impact of CNV cannot be fully captured through linkage disequilibrium with SNPs. These observations motivate the development of statistical methods for performing direct CNV association studies. We show through simulation that current tests for CNV association are prone to false-positive associations in the presence of differential errors between cases and controls, especially if quantitative CNV measurements are noisy. We present a statistical framework for performing case-control CNV association studies that applies likelihood ratio testing of quantitative CNV measurements in cases and controls. We show that our methods are robust to differential errors and noisy data and can achieve maximal theoretical power. We illustrate the power of these methods for testing for association with binary and quantitative traits, and have made this software available as the R package CNVtools.
Project description:Along with single nucleotide polymorphisms (SNPs), copy number variation (CNV) is considered an important source of genetic variation associated with disease susceptibility. Despite the importance of CNV, the tools currently available for its analysis often produce false positive results due to limitations such as low resolution of array platforms, platform specificity, and the type of CNV. To resolve this problem, spurious signals must be separated from true signals by visual inspection. None of the previously reported CNV analysis tools support this function and the simultaneous visualization of comparative genomic hybridization arrays (aCGH) and sequence alignment. The purpose of the present study was to develop a useful program for the efficient detection and visualization of CNV regions that enables the manual exclusion of erroneous signals.A JAVA-based stand-alone program called Genovar was developed. To ascertain whether a detected CNV region is a novel variant, Genovar compares the detected CNV regions with previously reported CNV regions using the Database of Genomic Variants (DGV, http://projects.tcag.ca/variation) and the Single Nucleotide Polymorphism Database (dbSNP). The current version of Genovar is capable of visualizing genomic data from sources such as the aCGH data file and sequence alignment format files.Genovar is freely accessible and provides a user-friendly graphic user interface (GUI) to facilitate the detection of CNV regions. The program also provides comprehensive information to help in the elimination of spurious signals by visual inspection, making Genovar a valuable tool for reducing false positive CNV results.http://genovar.sourceforge.net/.
Project description:MOTIVATION: Copy number variation (CNV), a major contributor to human genetic variation, comprises >/= 1 kb genomic deletions and insertions. Yet, the identification of CNVs from microarray data is still hampered by high false negative and positive prediction rates due to the noisy nature of the raw data. Here, we present CNVineta, an R package for rapid data mining and visualization of CNVs in large case-control datasets genotyped with single nucleotide polymorphism oligonucleotide arrays. CNVineta is compatible with various established CNV prediction algorithms, can be used for genome-wide association analysis of rare and common CNVs and enables rapid and serial display of log(2) of raw data ratios as well as B-allele frequencies for visual quality inspection. In summary, CNVineta aides in the interpretation of large-scale CNV datasets and prioritization of target regions for follow-up experiments. AVAILABILITY AND IMPLEMENTATION: CNVineta is available as an R package and can be downloaded from http://www.ikmb.uni-kiel.de/CNVineta/; the package contains a tutorial outlining a typical workflow. The CNVineta compatible HapMap dataset can also be downloaded from the link above.
Project description:Currently a few tools are capable of detecting genome-wide Copy Number Variations (CNVs) based on sequencing of multiple samples. Although aberrations in mate pair insertion sizes provide additional hints for the CNV detection based on multiple samples, the majority of the current tools rely only on the depth of coverage. Here, we propose a new algorithm (MSeq-CNV) which allows detecting common CNVs across multiple samples. MSeq-CNV applies a mixture density for modeling aberrations in depth of coverage and abnormalities in the mate pair insertion sizes. Each component in this mixture density applies a Binomial distribution for modeling the number of mate pairs with aberration in the insertion size and also a Poisson distribution for emitting the read counts, in each genomic position. MSeq-CNV is applied on simulated data and also on real data of six HapMap individuals with high-coverage sequencing, in 1000 Genomes Project. These individuals include a CEU trio of European ancestry and a YRI trio of Nigerian ethnicity. Ancestry of these individuals is studied by clustering the identified CNVs. MSeq-CNV is also applied for detecting CNVs in two samples with low-coverage sequencing in 1000 Genomes Project and six samples form the Simons Genome Diversity Project.
Project description:Due to the advancement in high throughput technology, single nucleotide polymorphism (SNP) is routinely being incorporated along with phenotypic information into genetic evaluation. However, this approach often cannot achieve high accuracy for some complex traits. It is possible that SNP markers are not sufficient to predict these traits due to the missing heritability caused by other genetic variations such as microsatellite and copy number variation (CNV), which have been shown to affect disease and complex traits in humans and other species.In this study, CNVs were included in a SNP based genomic selection framework. A Nellore cattle dataset consisting of 2230 animals genotyped on BovineHD SNP array was used, and 9 weight and carcass traits were analyzed. A total of six models were implemented and compared based on their prediction accuracy. For comparison, three models including only SNPs were implemented: 1) BayesA model, 2) Bayesian mixture model (BayesB), and 3) a GBLUP model without polygenic effects. The other three models incorporating both SNP and CNV included 4) a Bayesian model similar to BayesA (BayesA+CNV), 5) a Bayesian mixture model (BayesB+CNV), and 6) GBLUP with CNVs modeled as a covariable (GBLUP+CNV). Prediction accuracies were assessed based on Pearson's correlation between de-regressed EBVs (dEBVs) and direct genomic values (DGVs) in the validation dataset. For BayesA, BayesB and GBLUP, accuracy ranged from 0.12 to 0.62 across the nine traits. A minimal increase in prediction accuracy for some traits was noticed when including CNVs in the model (BayesA+CNV, BayesB+CNV, GBLUP+CNV).This study presents the first genomic prediction study integrating CNVs and SNPs in livestock. Combining CNV and SNP marker information proved to be beneficial for genomic prediction of some traits in Nellore cattle.
Project description:Accurate information on haplotypes and diplotypes (haplotype pairs) is required for population-genetic analyses; however, microarrays do not provide data on a haplotype or diplotype at a copy number variation (CNV) locus; they only provide data on the total number of copies over a diplotype or an unphased sequence genotype (e.g., AAB, unlike AB of single nucleotide polymorphism). Moreover, such copy numbers or genotypes are often incorrectly determined when microarray signal intensities derived from different copy numbers or genotypes are not clearly separated due to noise. Here we report an algorithm to infer CNV haplotypes and individuals' diplotypes at multiple loci from noisy microarray data, utilizing the probability that a signal intensity may be derived from different underlying copy numbers or genotypes. Performing simulation studies based on known diplotypes and an error model obtained from real microarray data, we demonstrate that this probabilistic approach succeeds in accurate inference (error rate: 1-2%) from noisy data, whereas previous deterministic approaches failed (error rate: 12-18%). Applying this algorithm to real microarray data, we estimated haplotype frequencies and diplotypes in 1486 CNV regions for 100 individuals. Our algorithm will facilitate accurate population-genetic analyses and powerful disease association studies of CNVs.
Project description:Many computational methods have been widely used to identify transcription regulatory interactions based on gene expression profiles. The selection of dependency measure is very important for successful regulatory network inference. In this paper, we develop a new method-DBoMM (Difference in BIC of Mixture Models)-for estimating dependency of gene by fitting the gene expression profiles into mixture Gaussian models. We show that DBoMM out-performs 4 other existing methods, including Kendall's tau correlation (TAU), Pearson Correlation (COR), Euclidean distance (EUC) and Mutual information (MI) using Escherichia coli, Saccharomyces cerevisiae, Drosophila melanogaster, Arabidopsis thaliana data and synthetic data. DBoMM can also identify condition-dependent regulatory interactions and is robust to noisy data. Of the 741 Escherichia coli regulatory interactions inferred by DBoMM at a 60% true positive rate, 65 are previously known interactions and 676 are novel predictions. To validate the new prediction, the promoter sequences of target genes regulated by the same transcription factors were analyzed and significant motifs were identified.
Project description:Copy number variants (CNVs) are a major cause of several genetic disorders, making their detection an essential component of genetic analysis pipelines. Current methods for detecting CNVs from exome-sequencing data are limited by high false-positive rates and low concordance because of inherent biases of individual algorithms. To overcome these issues, calls generated by two or more algorithms are often intersected using Venn diagram approaches to identify "high-confidence" CNVs. However, this approach is inadequate, because it misses potentially true calls that do not have consensus from multiple callers. Here, we present CN-Learn, a machine-learning framework that integrates calls from multiple CNV detection algorithms and learns to accurately identify true CNVs using caller-specific and genomic features from a small subset of validated CNVs. Using CNVs predicted by four exome-based CNV callers (CANOES, CODEX, XHMM, and CLAMMS) from 503 samples, we demonstrate that CN-Learn identifies true CNVs at higher precision (?90%) and recall (?85%) rates while maintaining robust performance even when trained with minimal data (?30 samples). CN-Learn recovers twice as many CNVs compared to individual callers or Venn diagram-based approaches, with features such as exome capture probe count, caller concordance, and GC content providing the most discriminatory power. In fact, ?58% of all true CNVs recovered by CN-Learn were either singletons or calls that lacked support from at least one caller. Our study underscores the limitations of current approaches for CNV identification and provides an effective method that yields high-quality CNVs for application in clinical diagnostics.