Dataset Information

MethyLImp2: faster missing value estimation for DNA methylation data.

ABSTRACT:

Motivation

methyLImp, a method we recently introduced for the missing value estimation of DNA methylation data, has demonstrated competitive performance in data imputation compared to the existing, general-purpose, approaches. However, imputation running time was considerably long and unfeasible in case of large datasets with numerous missing values.

Results

methyLImp2 made possible computations that were previously unfeasible. We achieved this by introducing two important modifications that have significantly reduced the original running time without sacrificing prediction performance. First, we implemented a chromosome-wise parallel version of methyLImp. This parallelization reduced the runtime by several 10-fold in our experiments. Then, to handle large datasets, we also introduced a mini-batch approach that uses only a subset of the samples for the imputation. Thus, it further reduces the running time from days to hours or even minutes in large datasets.

Availability and implementation

The R package methyLImp2 is under review for Bioconductor. It is currently freely available on Github https://github.com/annaplaksienko/methyLImp2.

SUBMITTER: Plaksienko A

PROVIDER: S-EPMC10826905 | biostudies-literature | 2024 Jan

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

methyLImp2: faster missing value estimation for DNA methylation data.

Plaksienko Anna A Di Lena Pietro P Nardini Christine C Angelini Claudia C

Bioinformatics (Oxford, England) 20240101 1

<h4>Motivation</h4>methyLImp, a method we recently introduced for the missing value estimation of DNA methylation data, has demonstrated competitive performance in data imputation compared to the existing, general-purpose, approaches. However, imputation running time was considerably long and unfeasible in case of large datasets with numerous missing values.<h4>Results</h4>methyLImp2 made possible computations that were previously unfeasible. We achieved this by introducing two important modific ...[more]

PMID: 38213002

Similar Datasets

Project description:BackgroundThe COVID-19 pandemic has revealed significant challenges in disease forecasting and in developing a public health response, emphasizing the need to manage missing data from various sources in making accurate forecasts.ObjectiveWe aimed to show how handling missing data can affect estimates of the COVID-19 incidence rate (CIR) in different pandemic situations.MethodsThis study used data from the COVID-19/SARS-CoV-2 surveillance system at the National Institute of Hygiene and Epidemiology, Vietnam. We separated the available data set into 3 distinct periods: zero COVID-19, transition, and new normal. We randomly removed 5% to 30% of data that were missing completely at random, with a break of 5% at each time point in the variable daily caseload of COVID-19. We selected 7 analytical methods to assess the effects of handling missing data and calculated statistical and epidemiological indices to measure the effectiveness of each method.ResultsOur study examined missing data imputation performance across 3 study time periods: zero COVID-19 (n=3149), transition (n=1290), and new normal (n=9288). Imputation analyses showed that K-nearest neighbor (KNN) had the lowest mean absolute percentage change (APC) in CIR across the range (5% to 30%) of missing data. For instance, with 15% missing data, KNN resulted in 10.6%, 10.6%, and 9.7% average bias across the zero COVID-19, transition, and new normal periods, compared to 39.9%, 51.9%, and 289.7% with the maximum likelihood method. The autoregressive integrated moving average model showed the greatest mean APC in the mean number of confirmed cases of COVID-19 during each COVID-19 containment cycle (CCC) when we imputed the missing data in the zero COVID-19 period, rising from 226.3% at the 5% missing level to 6955.7% at the 30% missing level. Imputing missing data with median imputation methods had the lowest bias in the average number of confirmed cases in each CCC at all levels of missing data. In detail, in the 20% missing scenario, while median imputation had an average bias of 16.3% for confirmed cases in each CCC, which was lower than the KNN figure, maximum likelihood imputation showed a bias on average of 92.4% for confirmed cases in each CCC, which was the highest figure. During the new normal period in the 25% and 30% missing data scenarios, KNN imputation had average biases for CIR and confirmed cases in each CCC ranging from 21% to 32% for both, while maximum likelihood and moving average imputation showed biases on average above 250% for both CIR and confirmed cases in each CCC.ConclusionsOur study emphasizes the importance of understanding that the specific imputation method used by investigators should be tailored to the specific epidemiological context and data collection environment to ensure reliable estimates of the CIR.