Project description:BackgroundGenome-wide methylation profiling has led to more comprehensive insights into gene regulation mechanisms and potential therapeutic targets. Illumina Human Methylation BeadChip is one of the most commonly used genome-wide methylation platforms. Similar to other microarray experiments, methylation data is susceptible to various technical artifacts, particularly batch effects. To date, little attention has been given to issues related to normalization and batch effect correction for this kind of data.MethodsWe evaluated three common normalization approaches and investigated their performance in batch effect removal using three datasets with different degrees of batch effects generated from HumanMethylation27 platform: quantile normalization at average β value (QNβ); two step quantile normalization at probe signals implemented in "lumi" package of R (lumi); and quantile normalization of A and B signal separately (ABnorm). Subsequent Empirical Bayes (EB) batch adjustment was also evaluated.ResultsEach normalization could remove a portion of batch effects and their effectiveness differed depending on the severity of batch effects in a dataset. For the dataset with minor batch effects (Dataset 1), normalization alone appeared adequate and "lumi" showed the best performance. However, all methods left substantial batch effects intact in the datasets with obvious batch effects and further correction was necessary. Without any correction, 50 and 66 percent of CpGs were associated with batch effects in Dataset 2 and 3, respectively. After QNβ, lumi or ABnorm, the number of CpGs associated with batch effects were reduced to 24, 32, and 26 percent for Dataset 2; and 37, 46, and 35 percent for Dataset 3, respectively. Additional EB correction effectively removed such remaining non-biological effects. More importantly, the two-step procedure almost tripled the numbers of CpGs associated with the outcome of interest for the two datasets.ConclusionGenome-wide methylation data from Infinium Methylation BeadChip can be susceptible to batch effects with profound impacts on downstream analyses and conclusions. Normalization can reduce part but not all batch effects. EB correction along with normalization is recommended for effective batch effect removal.
Project description:Motivation:Controlling for tumor purity in molecular analyses is essential to allow for reliable genomic aberration calls, for inter-sample comparison and to monitor heterogeneity of cancer cell populations. In genome wide screening studies, the assessment of tumor purity is typically performed by means of computational methods that exploit somatic copy number aberrations. Results:We present a strategy, called Purity Assessment from clonal MEthylation Sites (PAMES), which uses the methylation level of a few dozen, highly clonal, tumor type specific CpG sites to estimate the purity of tumor samples, without the need of a matched benign control. We trained and validated our method in more than 6000 samples from different datasets. Purity estimates by PAMES were highly concordant with other state-of-the-art tools and its evaluation in a cancer cell line dataset highlights its reliability to accurately estimate tumor admixtures. We extended the capability of PAMES to the analysis of CpG islands instead of the more platform-specific CpG sites and demonstrated its accuracy in a set of advanced tumors profiled by high throughput DNA methylation sequencing. These analyses show that PAMES is a valuable tool to assess the purity of tumor samples in the settings of clinical research and diagnostics. Availability and implementation:https://github.com/cgplab/PAMES. Contact:matteo.benelli@uslcentro.toscana.it or f.demichelis@unitn.it. Supplementary information:Supplementary data are available at Bioinformatics online.
Project description:MotivationThe recently released Infinium HumanMethylation450 array (the '450k' array) provides a high-throughput assay to quantify DNA methylation (DNAm) at ∼450 000 loci across a range of genomic features. Although less comprehensive than high-throughput sequencing-based techniques, this product is more cost-effective and promises to be the most widely used DNAm high-throughput measurement technology over the next several years.ResultsHere we describe a suite of computational tools that incorporate state-of-the-art statistical techniques for the analysis of DNAm data. The software is structured to easily adapt to future versions of the technology. We include methods for preprocessing, quality assessment and detection of differentially methylated regions from the kilobase to the megabase scale. We show how our software provides a powerful and flexible development platform for future methods. We also illustrate how our methods empower the technology to make discoveries previously thought to be possible only with sequencing-based methods.Availability and implementationhttp://bioconductor.org/packages/release/bioc/html/minfi.html.Contactkhansen@jhsph.edu; rafa@jimmy.harvard.eduSupplementary informationSupplementary data are available at Bioinformatics online.
Project description:The purity of tissue samples can affect the accuracy and utility of DNA methylation array analyses. This is particularly important for the placenta which is globally hypomethylated compared to other tissues. Placental villous tissue from early pregnancy terminations can be difficult to separate from non-villous tissue, resulting in potentially inaccurate results. We used several methods to identify mixed placenta samples using DNA methylation array datasets from our laboratory and those contained in the NCBI GEO database, highlighting the importance of determining sample purity during quality control processes.
Project description:Solid tissues collected from patient-driven clinical settings are composed of both normal and cancer cells, which often precede complications in data analysis and epigenetic findings. The Purity estimation of samples is crucial for reliable genomic aberration identification and uniform inter-sample and inter-patient comparisons as well. Here, an effective and flexible method has been developed and designed to estimate the level of methylation, which infers tumor purity without prior knowledge from the other datasets. The comprehensive analysis of our approach on Illumina Infinium 450 k methylation microarray explains that TCGA Breast Cancer data exhibits improved performance for purity assessment. This assessment has a strong correlation with other advanced methods.
Project description:The proposition of cancer cells in a tumor sample, named as tumor purity, is an intrinsic factor of tumor samples and has potentially great influence in variety of analyses including differential methylation, subclonal deconvolution and subtype clustering. InfiniumPurify is an integrated R package for estimating and accounting for tumor purity based on DNA methylation Infinium 450 k array data. InfiniumPurify has three main functions getPurity, InfiniumDMC and InfiniumClust, which could infer tumor purity, differential methylation analysis and tumor sample cluster accounting for estimated or user-provided tumor purities, respectively. The InfiniumPurify package provides a comprehensive analysis of tumor purity in cancer methylation research.
Project description:MotivationDNA methylation signatures in rheumatoid arthritis (RA) have been identified in fibroblast-like synoviocytes (FLS) with Illumina HumanMethylation450 array. Since <2% of CpG sites are covered by the Illumina 450K array and whole genome bisulfite sequencing is still too expensive for many samples, computationally predicting DNA methylation levels based on 450K data would be valuable to discover more RA-related genes.ResultsWe developed a computational model that is trained on 14 tissues with both whole genome bisulfite sequencing and 450K array data. This model integrates information derived from the similarity of local methylation pattern between tissues, the methylation information of flanking CpG sites and the methylation tendency of flanking DNA sequences. The predicted and measured methylation values were highly correlated with a Pearson correlation coefficient of 0.9 in leave-one-tissue-out cross-validations. Importantly, the majority (76%) of the top 10% differentially methylated loci among the 14 tissues was correctly detected using the predicted methylation values. Applying this model to 450K data of RA, osteoarthritis and normal FLS, we successfully expanded the coverage of CpG sites 18.5-fold and accounts for about 30% of all the CpGs in the human genome. By integrative omics study, we identified genes and pathways tightly related to RA pathogenesis, among which 12 genes were supported by triple evidences, including 6 genes already known to perform specific roles in RA and 6 genes as new potential therapeutic targets.Availability and implementationThe source code, required data for prediction, and demo data for test are freely available at: http://wanglab.ucsd.edu/star/LR450K/ CONTACT: wei-wang@ucsd.edu or gfirestein@ucsd.eduSupplementary informationSupplementary data are available at Bioinformatics online.
Project description:Formalin-fixed, paraffin-embedded (FFPE) samples are a highly desirable resource for epigenetic studies, but there is no suitable platform to assay genome-wide methylation in these widely available resources. Recently, Thirlwell et al. (2010) have reported a modified ligation-based DNA repair protocol to prepare FFPE DNA for the Infinium methylation assay. In this study, we have tested the accuracy of methylation data obtained with this modification by comparing paired fresh-frozen (FF) and FFPE colon tissue (normal and tumor) from colorectal cancer patients. We report locus-specific correlation and concordance of tumor-specific differentially methylated loci (DML), both of which were not previously assessed.We used Illumina's Infinium Methylation 27K chip for 12 pairs of FF and 12 pairs of FFPE tissue from tumor and surrounding healthy tissue from the resected colon of the same individual, after repairing the FFPE DNA using Thirlwell's modified protocol.For both tumor and normal tissue, overall correlation of ? values between all loci in paired FF and FFPE was comparable to previous studies. Tissue storage type (FF or FFPE) was found to be the most significant source of variation rather than tissue type (normal or tumor). We found a large number of DML between FF and FFPE DNA. Using ANOVA, we also identified DML in tumor compared to normal tissue in both FF and FFPE samples, and out of the top 50 loci in both groups only 7 were common, indicating poor concordance. Likewise, while looking at the correlation of individual loci between FFPE and FF across the patients, less than 10% of loci showed strong correlation (r ? 0.6). Finally, we checked the effect of the ligation-based modification on the Infinium chemistry for SNP genotyping on an independent set of samples, which also showed poor performance.Ligation of FFPE DNA prior to the Infinium genome-wide methylation assay may detect a reasonable number of loci, but the numbers of detected loci are much fewer than in FF samples. More importantly, the concordance of DML detected between FF and FFPE DNA is suboptimal, and DML from FFPE tissues should be interpreted with great caution.
Project description:The aims were to profile the DNA methylation in colorectal cancer (CRC) and to explore cancer-specific methylation biomarkers. Fifty-four pairs of CRCs and the adjacent normal tissues were subjected to Infinium Human Methylation 450K assay and analysed using ChAMP R package. A total of 26,093 differentially methylated probes were identified, which represent 6156 genes; 650 probes were hypermethylated, and 25,443 were hypomethylated. Hypermethylated sites were common in CpG islands, while hypomethylated sites were in open sea. Most of the hypermethylated genes were associated with pathways in cancer, while the hypomethylated genes were involved in the PI3K-AKT signalling pathway. Among the identified differentially methylated probes, we found evidence of four potential probes in CRCs versus adjacent normal; HOXA2 cg06786372, OPLAH cg17301223, cg15638338, and TRIM31 cg02583465 that could serve as a new biomarker in CRC since these probes were aberrantly methylated in CRC as well as involved in the progression of CRC. Furthermore, we revealed the potential of promoter methylation ADHFE1 cg18065361 in differentiating the CRC from normal colonic tissue from the integrated analysis. In conclusion, aberrant DNA methylation is significantly involved in CRC pathogenesis and is associated with gene silencing. This study reports several potential important methylated genes in CRC and, therefore, merit further validation as novel candidate biomarker genes in CRC.
Project description:BackgroundThe capacity of technologies measuring DNA methylation (DNAm) is rapidly evolving, as are the options for applicable bioinformatics methods. The most commonly used DNAm microarray, the Illumina Infinium HumanMethylation450 (450K array), has recently been replaced by the Illumina Infinium HumanMethylationEPIC (EPIC array), nearly doubling the number of targeted CpG sites. Given that a subset of 450K CpG sites is absent on the EPIC array and that several tools for both data normalization and analyses were developed on the 450K array, it is important to assess their utility when applied to EPIC array data. One of the most commonly used 450K tools is the pan-tissue epigenetic clock, a multivariate predictor of biological age based on DNAm at 353 CpG sites. Of these CpGs, 19 are missing from the EPIC array, thus raising the question of whether EPIC data can be used to accurately estimate DNAm age. We also investigated a 71-CpG epigenetic age predictor, referred to as the Hannum method, which lacks 6 probes on the EPIC array. To evaluate these epigenetic clocks in EPIC data properly, a prior assessment of the effects of data preprocessing methods on DNAm age is also required.MethodsDNAm was quantified, on both the 450K and EPIC platforms, from human primary monocytes derived from 172 individuals. We calculated DNAm age from raw, and three different preprocessed data forms to assess the effects of different processing methods on the DNAm age estimate. Using an additional cohort, we also investigated DNAm age of peripheral blood mononuclear cells, bronchoalveolar lavage, and bronchial brushing samples using the EPIC array.ResultsUsing monocyte-derived data from subjects on both the 450K and EPIC, we found that DNAm age was highly correlated across both raw and preprocessing methods (r > 0.91). Thus, the correlation between chronological age and the DNAm age estimate is largely unaffected by platform differences and normalization methods. However, we found that the choice of normalization method and measurement platform can lead to a systematic offset in the age estimate which in turn leads to an increase in the median error. Comparing the 450K and EPIC DNAm age estimates, we observed that the median absolute difference was 1.44-3.10 years across preprocessing methods.ConclusionsHere, we have provided evidence that the epigenetic clock is resistant to the lack of 19 CpG sites missing from the EPIC array as well as highlighted the importance of considering the technical variance of the epigenetic when interpreting group differences below the reported error. Furthermore, our study highlights the utility of epigenetic age acceleration measure, the residuals from a linear regression of DNAm age on chronological age, as the resulting values are robust with respect to normalization methods and measurement platforms.