Dataset Information

Transcriptome prediction performance across machine learning models and diverse ancestries.

ABSTRACT: Transcriptome prediction methods such as PrediXcan and FUSION have become popular in complex trait mapping. Most transcriptome prediction models have been trained in European populations using methods that make parametric linear assumptions like the elastic net (EN). To potentially further optimize imputation performance of gene expression across global populations, we built transcriptome prediction models using both linear and non-linear machine learning (ML) algorithms and evaluated their performance in comparison to EN. We trained models using genotype and blood monocyte transcriptome data from the Multi-Ethnic Study of Atherosclerosis (MESA) comprising individuals of African, Hispanic, and European ancestries and tested them using genotype and whole-blood transcriptome data from the Modeling the Epidemiology Transition Study (METS) comprising individuals of African ancestries. We show that the prediction performance is highest when the training and the testing population share similar ancestries regardless of the prediction algorithm used. While EN generally outperformed random forest (RF), support vector regression (SVR), and K nearest neighbor (KNN), we found that RF outperformed EN for some genes, particularly between disparate ancestries, suggesting potential robustness and reduced variability of RF imputation performance across global populations. When applied to a high-density lipoprotein (HDL) phenotype, we show including RF prediction models in PrediXcan revealed potential gene associations missed by EN models. Therefore, by integrating other ML modeling into PrediXcan and diversifying our training populations to include more global ancestries, we may uncover new genes associated with complex traits.

SUBMITTER: Okoro PC

PROVIDER: S-EPMC8087249 | biostudies-literature | 2021 Apr

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Transcriptome prediction performance across machine learning models and diverse ancestries.

Okoro Paul C PC Schubert Ryan R Guo Xiuqing X Johnson W Craig WC Rotter Jerome I JI Hoeschele Ina I Liu Yongmei Y Im Hae Kyung HK Luke Amy A Dugas Lara R LR Wheeler Heather E HE

HGG advances 20210105 2

Transcriptome prediction methods such as PrediXcan and FUSION have become popular in complex trait mapping. Most transcriptome prediction models have been trained in European populations using methods that make parametric linear assumptions like the elastic net (EN). To potentially further optimize imputation performance of gene expression across global populations, we built transcriptome prediction models using both linear and non-linear machine learning (ML) algorithms and evaluated their perf ...[more]

PMID: 33937878

Similar Datasets

Project description:BackgroundThe Society of Thoracic Surgeons and European System for Cardiac Operative Risk Evaluation (EuroSCORE) II risk scores are the most commonly used risk prediction models for in-hospital mortality after adult cardiac surgery. However, they are prone to miscalibration over time and poor generalization across data sets; thus, their use remains controversial. Despite increased interest, a gap in understanding the effect of data set drift on the performance of machine learning (ML) over time remains a barrier to its wider use in clinical practice. Data set drift occurs when an ML system underperforms because of a mismatch between the data it was developed from and the data on which it is deployed.ObjectiveIn this study, we analyzed the extent of performance drift using models built on a large UK cardiac surgery database. The objectives were to (1) rank and assess the extent of performance drift in cardiac surgery risk ML models over time and (2) investigate any potential influence of data set drift and variable importance drift on performance drift.MethodsWe conducted a retrospective analysis of prospectively, routinely gathered data on adult patients undergoing cardiac surgery in the United Kingdom between 2012 and 2019. We temporally split the data 70:30 into a training and validation set and a holdout set. Five novel ML mortality prediction models were developed and assessed, along with EuroSCORE II, for relationships between and within variable importance drift, performance drift, and actual data set drift. Performance was assessed using a consensus metric.ResultsA total of 227,087 adults underwent cardiac surgery during the study period, with a mortality rate of 2.76% (n=6258). There was strong evidence of a decrease in overall performance across all models (P<.0001). Extreme gradient boosting (clinical effectiveness metric [CEM] 0.728, 95% CI 0.728-0.729) and random forest (CEM 0.727, 95% CI 0.727-0.728) were the overall best-performing models, both temporally and nontemporally. EuroSCORE II performed the worst across all comparisons. Sharp changes in variable importance and data set drift from October to December 2017, from June to July 2018, and from December 2018 to February 2019 mirrored the effects of performance decrease across models.ConclusionsAll models show a decrease in at least 3 of the 5 individual metrics. CEM and variable importance drift detection demonstrate the limitation of logistic regression methods used for cardiac surgery risk prediction and the effects of data set drift. Future work will be required to determine the interplay between ML models and whether ensemble models could improve on their respective performance advantages.

Project description:Nitrogen is the most limiting nutrient for turfgrass growth. Instead of pursuing the maximum yield, most turfgrass managers use nitrogen (N) to maintain a sub-maximal growth rate. Few tools or soil tests exist to help managers guide N fertilizer decisions. Turf growth prediction models have the potential to be useful, but the currently existing turf growth prediction model only takes temperature into account, limiting its accuracy. This study developed machine-learning-based turf growth models using the random forest (RF) algorithm to estimate short-term turfgrass clipping yield. To build the RF model, a large set of variables were extracted as predictors including the 7-day weather, traffic intensity, soil moisture content, N fertilization rate, and the normalized difference red edge (NDRE) vegetation index. In this study, the data were collected from two putting greens where the turfgrass received 0 to 1,800 round/week traffic rates, various irrigation rates to maintain the soil moisture content between 9 and 29%, and N fertilization rates of 0 to 17.5 kg ha-1 applied biweekly. The RF model agreed with the actual clipping yield collected from the experimental results. The temperature and relative humidity were the most important weather factors. Including NDRE improved the prediction accuracy of the model. The highest coefficient of determination (R2) of the RF model was 0.64 for the training dataset and was 0.47 for the testing data set upon the evaluation of the model. This represented a large improvement over the existing growth prediction model (R 2 = 0.01). However, the machine-learning models created were not able to accurately predict the clipping production at other locations. Individual golf courses can create customized growth prediction models using clipping volume to eliminate the deviation caused by temporal and spatial variability. Overall, this study demonstrated the feasibility of creating machine-learning-based yield prediction models that may be able to guide N fertilization decisions on golf course putting greens and presumably other turfgrass areas.

Project description:BackgroundAccurate and reliable predictions of infectious disease can be valuable to public health organizations that plan interventions to decrease or prevent disease transmission. A great variety of models have been developed for this task. However, for different data series, the performance of these models varies. Hepatitis E, as an acute liver disease, has been a major public health problem. Which model is more appropriate for predicting the incidence of hepatitis E? In this paper, three different methods are used and the performance of the three methods is compared.MethodsAutoregressive integrated moving average(ARIMA), support vector machine(SVM) and long short-term memory(LSTM) recurrent neural network were adopted and compared. ARIMA was implemented by python with the help of statsmodels. SVM was accomplished by matlab with libSVM library. LSTM was designed by ourselves with Keras, a deep learning library. To tackle the problem of overfitting caused by limited training samples, we adopted dropout and regularization strategies in our LSTM model. Experimental data were obtained from the monthly incidence and cases number of hepatitis E from January 2005 to December 2017 in Shandong province, China. We selected data from July 2015 to December 2017 to validate the models, and the rest was taken as training set. Three metrics were applied to compare the performance of models, including root mean square error(RMSE), mean absolute percentage error(MAPE) and mean absolute error(MAE).ResultsBy analyzing data, we took ARIMA(1, 1, 1), ARIMA(3, 1, 2) as monthly incidence prediction model and cases number prediction model, respectively. Cross-validation and grid search were used to optimize parameters of SVM. Penalty coefficient C and kernel function parameter g were set 8, 0.125 for incidence prediction, and 22, 0.01 for cases number prediction. LSTM has 4 nodes. Dropout and L2 regularization parameters were set 0.15, 0.001, respectively. By the metrics of RMSE, we obtained 0.022, 0.0204, 0.01 for incidence prediction, using ARIMA, SVM and LSTM. And we obtained 22.25, 20.0368, 11.75 for cases number prediction, using three models. For MAPE metrics, the results were 23.5%, 21.7%, 15.08%, and 23.6%, 21.44%, 13.6%, for incidence prediction and cases number prediction, respectively. For MAE metrics, the results were 0.018, 0.0167, 0.011 and 18.003, 16.5815, 9.984, for incidence prediction and cases number prediction, respectively.ConclusionsComparing ARIMA, SVM and LSTM, we found that nonlinear models(SVM, LSTM) outperform linear models(ARIMA). LSTM obtained the best performance in all three metrics of RSME, MAPE, MAE. Hence, LSTM is the most suitable for predicting hepatitis E monthly incidence and cases number.

Dataset Information

Transcriptome prediction performance across machine learning models and diverse ancestries.

Publications

Transcriptome prediction performance across machine learning models and diverse ancestries.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets