Comparison of alternative mixture model methods to analyze bacterial CGH experiments with multi-genome arrays.
ABSTRACT: BACKGROUND: Microarray-based comparative genomic hybridization (aCGH) is used for rapid comparison of genomes of different bacterial strains. The purpose is to evaluate the distribution of genes from sequenced bacterial strains (control) among unsequenced strains (test). We previously compared the use of single strain versus multiple strain control with arrays covering multiple genomes. The conclusion was that a multiple strain control promoted a better separation of signals between present and absent genes. FINDINGS: We now extend our previous study by applying the Expectation-Maximization (EM) algorithm to fit a mixture model to the signal distribution in order to classify each gene as present or absent and by comparing different methods for analyzing aCGH data, using combinations of different control strain choices, two different statistical mixture models, with or without normalization, with or without logarithm transformation and with test-over-control or inverse signal ratio calculation. We also assessed the impact of replication on classification accuracy. Higher values of accuracy have been achieved using the ratio of control-over-test intensities, without logarithmic transformation and with a strain mix control. Normalization and the type of mixture model fitted by the EM algorithm did not have a significant impact on classification accuracy. Similarly, using the average of replicate arrays to perform the classification does not significantly improve the results. CONCLUSIONS: Our work provides a guiding benchmark comparison of alternative methods to analyze aCGH results that can impact on the analysis of currently ongoing comparative genomic projects or in the re-analysis of published studies.
Project description:BACKGROUND: Microarray comparative genomic hybridization (aCGH) evaluates the distribution of genes of sequenced bacterial strains among unsequenced strains of the same or related species. As genomic sequences from multiple strains of the same species become available, multistrain microarrays are designed, containing spots for every unique gene in all sequenced strains. To perform two-color aCGH experiments with multistrain microarrays, the choice of control sample can be the genomic DNA of one strain or a mixture of all the strains used in the array design. This important problem has no universally accepted solution. RESULTS: We performed a comparative study of the two control sample options with a Streptococcus pneumoniae microarray designed with three fully sequenced strains. We separately hybridized two of these strains (R6 and G54) as test samples using the third strain alone (TIGR4) or a mixture of the three strains as control. We show that for both types of control it is advantageous to analyze spots in separate sets according to their expected control channel signal (5-15% AUC increase). Following this analysis, the use of a mix control leads to higher accuracies (5% increase). This enhanced performance is due to gains in sensitivity (21% increase, p = 0.001) that compensate minor losses in specificity (5% decrease, p = 0.014). CONCLUSION: The use of a single strain control increases the error rate in genes that are part of the accessory genome, where more variation across unsequenced strains is expected, further justifying the use of the mix control.
Project description:BACKGROUND: Array-based comparative genome hybridization (aCGH) is commonly used to determine the genomic content of bacterial strains. Since prokaryotes in general have less conserved genome sequences than eukaryotes, sequence divergences between the genes in the genomes used for an aCGH experiment obstruct determination of genome variations (e.g. deletions). Current normalization methods do not take into consideration sequence divergence between target and microarray features and therefore cannot distinguish a difference in signal due to systematic errors in the data or due to sequence divergence. RESULTS: We present supervised Lowess, or S-Lowess, an application of the subset Lowess normalization method. By using a predicted subset of array features with minimal sequence divergence between the analyzed strains for the normalization procedure we remove systematic errors from dual-dye aCGH data in two steps: (1) determination of a subset of conserved genes (i.e. likely conserved genes, LCG); and (2) using the LCG for subset Lowess normalization. Subset Lowess determines the correction factors for systematic errors in the subset of array features and normalizes all array features using these correction factors. The performance of S-Lowess was assessed on aCGH experiments in which differentially labeled genomic DNA fragments of Lactococcus lactis IL1403 and L. lactis MG1363 strains were hybridized to IL1403 DNA microarrays. Since both genomes are sequenced and gene deletions identified, the success rate of different aCGH normalization methods in detecting these deletions in the MG1363 genome were determined. S-Lowess detects 97% of the deletions, whereas other aCGH normalization methods detect up to only 60% of the deletions. CONCLUSION: S-Lowess is implemented in a user-friendly web-tool accessible from http://bioinformatics.biol.rug.nl/websoftware/s-lowess. We demonstrate that it outperforms existing normalization methods and maximizes detection of genomic variation (e.g. deletions) from microbial aCGH data.
Project description:Background Array-based comparative genome hybridization (aCGH) is commonly used to determine the genomic content of bacterial strains. In aCGH data, systematic errors are comparable to those occurring in transcriptome data. However, especially for microbes, an additional source of variation exists: differences in hybridization due to gene sequence divergence between the strains hybridized. Current normalization methods do not take this source of variation into consideration. Results We present Supervised Lowess, or S-Lowess, an application of the subset Lowess normalization method that does take difference in genomic content into account. The performance of S-Lowess was assessed on aCGH experiments in which differentially labeled genomic DNA fragments of Lactococcus lactis IL1403 and L. lactis MG1363 strains were hybridized to IL1403 microarrays. Since both genome sequences are known (they have only on average 85 % sequence identity), the success rate in detecting deletions in the MG1363 genome of different aCGH normalization methods can be compared. S-Lowess detects 97% of the deletions, whereas other aCGH normalization methods detect up to only 60% of the deletions. Conclusions S-Lowess removes systematic errors from dual-dye aCGH data in two steps: (i) determination of likely homologous genes (LHG); and (ii) estimation of correction factors for systematic errors from spots of LHG and subset Lowess normalization of the remaining spots using these correction factors. It is implemented in a user-friendly web-tool accessible from http://bioinformatics.biol.rug.nl/websoftware/s-lowess. We demonstrate that it outperforms existing normalization methods and maximizes the number of detectable genomic deletions or duplications from microbial aCGH data. Keywords: comparative genome hybridisation In this study, 4 aCGH comparisons (slides) between L. lactis MG1363 and L. lactis IL1403 were performed (including dye swap with biological replicates; see also supplementary materials). The resulting aCGH slide signals were normalized using the different 'likely homologous gene' (LHG) sets (see above) yielding ratios of signals of labelled DNA of MG1363 over those of IL1403. A maximum of 8 ratios per amplicon (gene) were obtained for the 4 hybridized slides (each with 2 replicate spots per amplicon). Only genes with at least 5 measurements were used in this study. The normalization methods evaluated in this study are: a) no normalization, b) total signal normalization, c) grid-based Lowess (implemented in PreP; f = 0.7) [GarcM-CM--a de la Nava, J et al. 2003. Bioinformatics 19:2328-2329], and d) S-Lowess using different subsets of conserved lactococcal genes (for details see above). Results with the MANOR R package (spatial normalization; standard parameters) [Neuvial P, et al. 2006. BMC Bioinformatics 7:264; Liva S, et al. 2006. Nucleic Acids Res 34:W477-W481] are shown in our supplementary materials.
Project description:Modeling correlation structures is a challenge in bioinformatics, especially when dealing with high throughput genomic data. A compound hierarchical correlated beta mixture (CBM) with an exchangeable correlation structure is proposed to cluster genetic vectors into mixture components. The correlation coefficient, [Formula: see text], is homogenous within a mixture component and heterogeneous between mixture components. A random CBM with [Formula: see text] brings more flexibility in explaining correlation variations among genetic variables. Expectation-Maximization (EM) algorithm and Stochastic Expectation-Maximization (SEM) algorithm are used to estimate parameters of CBM. The number of mixture components can be determined using model selection criteria such as AIC, BIC and ICL-BIC. Extensive simulation studies were conducted to compare EM, SEM and model selection criteria. Simulation results suggest that CBM outperforms the traditional beta mixture model with lower estimation bias and higher classification accuracy. The proposed method is applied to cluster transcription factor-DNA binding probability in mouse genome data generated by Lahdesmaki and others (2008, Probabilistic inference of transcription factor binding from multiple data sources. PLoS One, 3: , e1820). The results reveal distinct clusters of transcription factors when binding to promoter regions of genes in JAK-STAT, MAPK and other two pathways.
Project description:Trichoderma reesei is the main industrial producer of cellulases and hemicellulases used to depolymerize biomass in many biotechnical applications. Many production strains in use have been generated by classical mutagenesis. In this study we characterized genomic alterations in hyperproducing mutants of T. reesei by high-resolution comparative genomic hybridisation tiling array. We carried out aCGH analysis of four hyperproducing strains (QM9123, QM9414, NG14 and RutC-30) using QM6a genome as a reference. ArrayCGH analysis identified dozens of mutations in each strain analyzed. 2.1 million oligonucleotide probe custom aCGH (HD2 format, RocheNimblegen) was designed according to T. reesei strain QM6a genome v2.0 (http://genome.jgi-psf.org/Trire2/Trire2.home.html). 14 samples are included in this set; 3 replicates of each strain (except two replicates of QM9123) were analyzed (four mutant strains and QM6a control strain for self-hybridization)
Project description:To determine how does genomic structural variation changed the phenotypes of yeast. aCGH and RNA-Seq were performed to reveal the differeces in the genomic structures and transcription of ZTW1 and ZGR3. This SuperSeries is composed of the following subset Series: GSE40905: Transcription profile analysis of S. cerevisiae ZTW1 wild-type and mutant strains GSE41108: Comparsion of the genomic structures between S. cerevisiae strains ZGR3 and BYZ1 In the aCGH experiment, strain BYZ1 (S288c background) was used as the control. In the RNA-Seq experiment, the total RNA from three independent cultured cells of each yeast strain was extracted. Three cDNA libraries of one sample were mixed before sequencing.
Project description:<h4>Background</h4>Array comparative genomic hybridization (aCGH) is a popular technique for detection of genomic copy number imbalances. These play a critical role in the onset of various types of cancer. In the analysis of aCGH data, normalization is deemed a critical pre-processing step. In general, aCGH normalization approaches are similar to those used for gene expression data, albeit both data-types differ inherently. A particular problem with aCGH data is that imbalanced copy numbers lead to improper normalization using conventional methods.<h4>Results</h4>In this study we present a novel method, called CGHnormaliter, which addresses this issue by means of an iterative normalization procedure. First, provisory balanced copy numbers are identified and subsequently used for normalization. These two steps are then iterated to refine the normalization. We tested our method on three well-studied tumor-related aCGH datasets with experimentally confirmed copy numbers. Results were compared to a conventional normalization approach and two more recent state-of-the-art aCGH normalization strategies. Our findings show that, compared to these three methods, CGHnormaliter yields a higher specificity and precision in terms of identifying the 'true' copy numbers.<h4>Conclusion</h4>We demonstrate that the normalization of aCGH data can be significantly enhanced using an iterative procedure that effectively eliminates the effect of imbalanced copy numbers. This also leads to a more reliable assessment of aberrations. An R-package containing the implementation of CGHnormaliter is available at http://www.ibi.vu.nl/programs/cghnormaliterwww.
Project description:Genomic instability is one of the fundamental factors in tumorigenesis and tumor progression. Many studies have shown that copy-number abnormalities at the DNA level are important in the pathogenesis of cancer. Array comparative genomic hybridization (aCGH), developed based on expression microarray technology, can reveal the chromosomal aberrations in segmental copies at a high resolution. However, due to the nature of aCGH, many standard expression data processing tools, such as data normalization, often fail to yield satisfactory results.We demonstrated a novel aCGH normalization algorithm, which provides an accurate aCGH data normalization by utilizing the dependency of neighboring probe measurements in aCGH experiments. To facilitate the study, we have developed a hidden Markov model (HMM) to simulate a series of aCGH experiments with random DNA copy number alterations that are used to validate the performance of our normalization. In addition, we applied the proposed normalization algorithm to an aCGH study of lung cancer cell lines. By using the proposed algorithm, data quality and the reliability of experimental results are significantly improved, and the distinct patterns of DNA copy number alternations are observed among those lung cancer cell lines.Source codes and.gures may be found at http://ntumaps.cgm.ntu.edu.tw/aCGH_supplementary.
Project description:Nimblegen arrays were used to perform aCGH analysis on 2 mice strains, one which lacks the Ly49 gene cluster and the parent strain which still has the cluster. A targeted knock-out of the cluster resulted in the rearrangment and deletion of the cluster. The aCGH experiment was performed to elucidate the extent of the deletion. high resolution aCGH analysis of the Ly49 cluster in deletion strain
Project description:There are many species of environmental mycobacteria (EM) that infect animals that are important to the economy and research and that also have zoonotic potential. The genomes of very few of these bacterial species have been sequenced, and little is known about the molecular mechanisms by which most of these opportunistic pathogens cause disease. In this study, 18 isolates of EM isolated from fish and humans (including strains of Mycobacterium avium, Mycobacterium peregrinum, Mycobacterium chelonae, and Mycobacterium salmoniphilum) were examined for their abilities to grow in macrophage lines from humans, mice, and carp. Genomic DNA from 14 of these isolates was then hybridized against DNA from an M. avium reference strain, with a custom microarray containing virulence genes of mycobacteria and a selection of representative genes from metabolic pathways. The strains of EM had different abilities to grow within the three types of cell lines, which grouped largely according to the host from which they were isolated. Genes identified as being putatively absent in some of the strains included those with response regulatory functions, cell wall compositions, and fatty acid metabolisms as well as a recently identified pathogenicity island important to macrophage uptake. Further understanding of the role these genes play in host specificity and pathogenicity will be important to gain insight into the zoonotic potential of certain EM as well as their mechanisms of virulence.