Project description:Risk prediction plays an important role in clinical cardiology research. Traditionally, most risk models have been based on regression models. While useful and robust, these statistical methods are limited to using a small number of predictors which operate in the same way on everyone, and uniformly throughout their range. The purpose of this review is to illustrate the use of machine-learning methods for development of risk prediction models. Typically presented as black box approaches, most machine-learning methods are aimed at solving particular challenges that arise in data analysis that are not well addressed by typical regression approaches. To illustrate these challenges, as well as how different methods can address them, we consider trying to predicting mortality after diagnosis of acute myocardial infarction. We use data derived from our institution's electronic health record and abstract data on 13 regularly measured laboratory markers. We walk through different challenges that arise in modelling these data and then introduce different machine-learning approaches. Finally, we discuss general issues in the application of machine-learning methods including tuning parameters, loss functions, variable importance, and missing data. Overall, this review serves as an introduction for those working on risk modelling to approach the diffuse field of machine learning.
Project description:Citizen science is mainstream: millions of people contribute data to a growing array of citizen science projects annually, forming massive datasets that will drive research for years to come. Many citizen science projects implement a "leaderboard" framework, ranking the contributions based on number of records or species, encouraging further participation. But is every data point equally "valuable?" Citizen scientists collect data with distinct spatial and temporal biases, leading to unfortunate gaps and redundancies, which create statistical and informational problems for downstream analyses. Up to this point, the haphazard structure of the data has been seen as an unfortunate but unchangeable aspect of citizen science data. However, we argue here that this issue can actually be addressed: we provide a very simple, tractable framework that could be adapted by broadscale citizen science projects to allow citizen scientists to optimize the marginal value of their efforts, increasing the overall collective knowledge.
Project description:The use of imaging systems in protein crystallisation means that the experimental setups no longer require manual inspection to determine the outcome of the trials. However, it leads to the problem of how best to find images which contain useful information about the crystallisation experiments. The adoption of a deeplearning approach in 2018 enabled a four-class machine classification system of the images to exceed human accuracy for the first time. Underpinning this was the creation of a labelled training set which came from a consortium of several different laboratories. The MARCO classification model does not have the same accuracy on local data as it does on images from the original test set; this can be somewhat mitigated by retraining the ML model and including local images. We have characterized the image data used in the original MARCO model, and performed extensive experiments to identify training settings most likely to enhance the local performance of a MARCO-dataset based ML classification model.
Project description:Molecular "fingerprints" encoding structural information are the workhorse of cheminformatics and machine learning in drug discovery applications. However, fingerprint representations necessarily emphasize particular aspects of the molecular structure while ignoring others, rather than allowing the model to make data-driven decisions. We describe molecular graph convolutions, a machine learning architecture for learning from undirected graphs, specifically small molecules. Graph convolutions use a simple encoding of the molecular graph-atoms, bonds, distances, etc.-which allows the model to take greater advantage of information in the graph structure. Although graph convolutions do not outperform all fingerprint-based methods, they (along with other graph-based methods) represent a new paradigm in ligand-based virtual screening with exciting opportunities for future improvement.
Project description:BackgroundTwo-sample summary-data Mendelian randomization (MR) incorporating multiple genetic variants within a meta-analysis framework is a popular technique for assessing causality in epidemiology. If all genetic variants satisfy the instrumental variable (IV) and necessary modelling assumptions, then their individual ratio estimates of causal effect should be homogeneous. Observed heterogeneity signals that one or more of these assumptions could have been violated.MethodsCausal estimation and heterogeneity assessment in MR require an approximation for the variance, or equivalently the inverse-variance weight, of each ratio estimate. We show that the most popular 'first-order' weights can lead to an inflation in the chances of detecting heterogeneity when in fact it is not present. Conversely, ostensibly more accurate 'second-order' weights can dramatically increase the chances of failing to detect heterogeneity when it is truly present. We derive modified weights to mitigate both of these adverse effects.ResultsUsing Monte Carlo simulations, we show that the modified weights outperform first- and second-order weights in terms of heterogeneity quantification. Modified weights are also shown to remove the phenomenon of regression dilution bias in MR estimates obtained from weak instruments, unlike those obtained using first- and second-order weights. However, with small numbers of weak instruments, this comes at the cost of a reduction in estimate precision and power to detect a causal effect compared with first-order weighting. Moreover, first-order weights always furnish unbiased estimates and preserve the type I error rate under the causal null. We illustrate the utility of the new method using data from a recent two-sample summary-data MR analysis to assess the causal role of systolic blood pressure on coronary heart disease risk.ConclusionsWe propose the use of modified weights within two-sample summary-data MR studies for accurately quantifying heterogeneity and detecting outliers in the presence of weak instruments. Modified weights also have an important role to play in terms of causal estimation (in tandem with first-order weights) but further research is required to understand their strengths and weaknesses in specific settings.
Project description:ObjectivesThis study explored the prognostic factors and developed a prediction model for Chinese-American (CA) cervical cancer (CC) patients. We compared two alternative models (the restricted mean survival time (RMST) model and the proportional baselines landmark supermodel (PBLS model, producing dynamic prediction)) versus the Cox proportional hazards model in the context of time-varying effects.Setting and data sourcesA total of 713 CA women with CC and available covariates (age at diagnosis, International Federation of Gynecology and Obstetrics (FIGO) stage, lymph node metastasis and radiation) from the Surveillance, Epidemiology and End Results database were included.DesignWe applied the Cox proportional hazards model to analyse the all-cause mortality with the proportional hazards assumption. Additionally, we applied two alternative models to analyse covariates with time-varying effects. The performances of the models were compared using the C-index for discrimination and the shrinkage slope for calibration.ResultsOlder patients had a worse survival rate than younger patients. Advanced FIGO stage patients showed a relatively poor survival rate and low life expectancy. Lymph node metastasis was an unfavourable prognostic factor in our models. Age at diagnosis, FIGO stage and lymph node metastasis represented time-varying effects from the PBLS model. Additionally, radiation showed no impact on survival in any model. Dynamic prediction presented a better performance for 5-year dynamic death rates than did the Cox proportional hazards model.ConclusionsWith the time-varying effects, the RMST model was suggested to explore diagnosis factors, and the PBLS model was recommended to predict a patient's w-year dynamic death rate.
Project description:Background: Missing data in electronic health records (EHRs) presents significant challenges in medical studies. Many methods have been proposed, but uncertainty exists regarding the current state of missing data addressing methods applied for EHR and which strategy performs better within specific contexts. Methods: All studies referencing EHR and missing data methods published from their inception until 2024 March 30 were searched via the MEDLINE, EMBASE, and Digital Bibliography and Library Project databases. The characteristics of the included studies were extracted. We also compared the performance of various methods under different missingness scenarios. Results: After screening, 46 studies published between 2010 and 2024 were included. Three missingness mechanisms were simulated when evaluating the missing data methods: missing completely at random (29/46), missing at random (20/46), and missing not at random (21/46). Multiple imputation by chained equations (MICE) was the most popular statistical method, whereas generative adversarial network-based methods and the k nearest neighbor (KNN) classification were the common deep-learning-based or traditional machine-learning-based methods, respectively. Among the 26 articles comparing the performance among medical statistical and machine learning approaches, traditional machine learning or deep learning methods generally outperformed statistical methods. Med.KNN and context-aware time-series imputation performed better for longitudinal datasets, whereas probabilistic principal component analysis and MICE-based methods were optimal for cross-sectional datasets. Conclusions: Machine learning methods show significant promise for addressing missing data in EHRs. However, no single approach provides a universally generalizable solution. Standardized benchmarking analyses are essential to evaluate these methods across different missingness scenarios.
Project description:BackgroundLarge-scale patterns or trends in species diversity have long interested ecologists. The classic pattern is for diversity (e.g., species richness) to decrease with increasing latitude. Taxonomic distinctness is a diversity measure based on the relatedness of the species within a sample. Here we examined patterns of taxonomic distinctness in relation to latitude (ca. 32-48 degrees N) and depth (ca. 50-1220 m) for demersal fishes on the continental shelf and slope of the US Pacific coast.Methodology/principal findingsBoth average taxonomic distinctness (AvTD) and variation in taxonomic distinctness (VarTD) changed with latitude and depth. AvTD was highest at approximately 500 m and lowest at around 200 m bottom depth. Latitudinal trends in AvTD were somewhat weaker and were depth-specific. AvTD increased with latitude on the shelf (50-150 m) but tended to decrease with latitude at deeper depths. Variation in taxonomic distinctness (VarTD) was highest around 300 m. As with AvTD, latitudinal trends in VarTD were depth-specific. On the shelf (50-150 m), VarTD increased with latitude, while in deeper areas the patterns were more complex. Closer inspection of the data showed that the number and distribution of species within the class Chondrichthyes were the primary drivers of the overall patterns seen in AvTD and VarTD, while the relatedness and distribution of species in the order Scorpaeniformes appeared to cause the relatively low observed values of AvTD at around 200 m.Conclusions/significanceThese trends contrast to some extent the patterns seen in earlier studies for species richness and evenness in demersal fishes along this coast and add to our understanding of diversity of the demersal fishes of the California Current.
Project description:Myelofibrosis (MF) is a myeloproliferative neoplasm characterized by ineffective clonal hematopoiesis, splenomegaly, bone marrow fibrosis, and the propensity for transformation to acute myeloid leukemia. The discovery of mutations in JAK2, CALR, and MPL have uncovered activated JAK-STAT signaling as a primary driver of MF, supporting a rationale for JAK inhibition. However, JAK inhibition alone is insufficient for long-term remission and offers modest, if any, disease-modifying effects. Given this, there is great interest in identifying mechanisms that cooperate with JAK-STAT signaling to predict disease progression and rationally guide the development of novel therapies. This review outlines the latest discoveries in the biology of MF, discusses current clinical management of patients with MF, and summarizes the ongoing clinical trials that hope to change the landscape of MF treatment.