Project description:Quantifying sequence-specific protein-ligand interactions is critical for understanding and exploiting numerous cellular processes, including gene expression regulation and signal transduction. Given their importance, next-generation sequencing (NGS) based assays that characterize such recognition with high-throughput are increasingly being used to profile a range of protein classes and interactions. However, these methods do not measure the biophysical parameters that have long been used to uncover the quantitative rules underlying sequence recognition. We developed a highly flexible machine learning framework, called ProBound, to quantify sequence recognition in terms of biophysical parameters based on NGS data. ProBound quantifies transcription factor (TF) behavior with models that accurately predict binding affinity over a range exceeding that of previous resources, captures the impact of DNA modifications and conformational flexibility of multi-TF complexes, and infers specificity directly from \textit{in vivo} data such as ChIP-seq without peak calling. When coupled with a new assay called Kd-seq, it quantifies the absolute affinity of protein-ligand interactions. Its applicability extends beyond thermodynamic equilibrium binding, to the kinetics of kinase-substrate interactions. Altogether, ProBound provides a versatile algorithmic framework for understanding sequence recognition in a wide variety of biological contexts.
Project description:High-Reynolds number homogeneous isotropic turbulence (HIT) is fully described within the Navier-Stokes (NS) equations, which are notoriously difficult to solve numerically. Engineers, interested primarily in describing turbulence at a reduced range of resolved scales, have designed heuristics, known as large eddy simulation (LES). LES is described in terms of the temporally evolving Eulerian velocity field defined over a spatial grid with the mean-spacing correspondent to the resolved scale. This classic Eulerian LES depends on assumptions about effects of subgrid scales on the resolved scales. Here, we take an alternative approach and design LES heuristics stated in terms of Lagrangian particles moving with the flow. Our Lagrangian LES, thus L-LES, is described by equations generalizing the weakly compressible smoothed particle hydrodynamics formulation with extended parametric and functional freedom, which is then resolved via Machine Learning training on Lagrangian data from direct numerical simulations of the NS equations. The L-LES model includes physics-informed parameterization and functional form, by combining physics-based parameters and physics-inspired Neural Networks to describe the evolution of turbulence within the resolved range of scales. The subgrid-scale contributions are modeled separately with physical constraints to account for the effects from unresolved scales. We build the resulting model under the differentiable programming framework to facilitate efficient training. We experiment with loss functions of different types, including physics-informed ones accounting for statistics of Lagrangian particles. We show that our L-LES model is capable of reproducing Eulerian and unique Lagrangian turbulence structures and statistics over a range of turbulent Mach numbers.
Project description:The rapid advances in science and technology in the field of artificial neural networks have led to noticeable interest in the application of this technology in medicine. Given the need to develop medical sensors that monitor vital signs to meet both people's needs in real life and in clinical research, the use of computer-based techniques should be considered. This paper describes the latest progress in heart rate sensors empowered by machine learning methods. The paper is based on a review of the literature and patents from recent years, and is reported according to the PRISMA 2020 statement. The most important challenges and prospects in this field are presented. Key applications of machine learning are discussed in medical sensors used for medical diagnostics in the area of data collection, processing, and interpretation of results. Although current solutions are not yet able to operate independently, especially in the diagnostic context, it is likely that medical sensors will be further developed using advanced artificial intelligence methods.
Project description:Techniques of data mining and machine learning were applied to a large database of medical and facility claims from commercially insured patients to determine the prevalence, gender demographics, and costs for individuals with provider-assigned diagnosis codes for myalgic encephalomyelitis (ME) or chronic fatigue syndrome (CFS). The frequency of diagnosis was 519-1,038/100,000 with the relative risk of females being diagnosed with ME or CFS compared to males 1.238 and 1.178, respectively. While the percentage of women diagnosed with ME/CFS is higher than the percentage of men, ME/CFS is not a "women's disease." Thirty-five to forty percent of diagnosed patients are men. Extrapolating from this frequency of diagnosis and based on the estimated 2017 population of the United States, a rough estimate for the number of patients who may be diagnosed with ME or CFS in the U.S. is 1.7 million to 3.38 million. Patients diagnosed with CFS appear to represent a more heterogeneous group than those diagnosed with ME. A machine learning model based on characteristics of individuals diagnosed with ME was developed and applied, resulting in a predicted prevalence of 857/100,000 (p > 0.01), or roughly 2.8 million in the U.S. Average annual costs for individuals with a diagnosis of ME or CFS were compared with those for lupus (all categories) and multiple sclerosis (MS), and found to be 50% higher for ME and CFS than for lupus or MS, and three to four times higher than for the general insured population. A separate aspect of the study attempted to determine if a diagnosis of ME or CFS could be predicted based on symptom codes in the insurance claims records. Due to the absence of specific codes for some core symptoms, we were unable to validate that the information in insurance claims records is sufficient to identify diagnosed patients or suggest that a diagnosis of ME or CFS should be considered based solely on looking for presence of those symptoms. These results show that a prevalence rate of 857/100,000 for ME/CFS is not unreasonable; therefore, it is not a rare disease, but in fact a relatively common one.
Project description:AimsModels predicting mortality in heart failure (HF) patients are often limited with regard to performance and applicability. The aim of this study was to develop a reliable algorithm to compute expected in-hospital mortality rates in HF cohorts on a population level based on administrative data comparing regression analysis with different machine learning (ML) models.Methods and resultsInpatient cases with primary International Statistical Classification of Diseases and Related Health Problems (ICD-10) encoded discharge diagnosis of HF non-electively admitted to 86 German Helios hospitals between 1 January 2016 and 31 December 2018 were identified. The dataset was randomly split 75%/25% for model development and testing. Highly unbalanced variables were removed. Four ML algorithms were applied, and all algorithms were tuned using a grid search with multiple repetitions. Model performance was evaluated by computing receiver operating characteristic areas under the curve. In total, 59 125 cases (69.8% aged 75 years or older, 51.9% female) were investigated, and in-hospital mortality was 6.20%. Areas under the curve of all ML algorithms outperformed regression analysis in the testing dataset with values of 0.829 [95% confidence interval (CI) 0.814-0.843] for logistic regression, 0.875 (95% CI 0.863-0.886) for random forest, 0.882 (95% CI 0.871-0.893) for gradient boosting machine, 0.866 (95% CI 0.854-0.878) for single-layer neural networks, and 0.882 (95% CI 0.872-0.893) for extreme gradient boosting. Brier scores demonstrated a good calibration especially of the latter three models.ConclusionsWe introduced reliable models to calculate expected in-hospital mortality based only on administrative routine data using ML algorithms. A broad application could supplement quality measurement programs and therefore improve future HF patient care.
Project description:Accurate identification of patient populations is an essential component of clinical research, especially for medical conditions such as chronic cough that are inconsistently defined and diagnosed. We aimed to develop and compare machine learning models to identify chronic cough from medical and pharmacy claims data. In this retrospective observational study, we compared 3 machine learning algorithms based on XG Boost, logistic regression, and neural network approaches using a large claims and electronic health record database. Of the 327,423 patients who met the study criteria, 4,818 had chronic cough based on linked claims-electronic health record data. The XG Boost model showed the best performance, achieving a Receiver-Operator Characteristic Area Under the Curve (ROC-AUC) of 0.916. We selected a cutoff that favors a high positive predictive value (PPV) to minimize false positives, resulting in a sensitivity, specificity, PPV, and negative predictive value of 18.0%, 99.6%, 38.7%, and 98.8%, respectively on the held-out testing set (n = 82,262). Logistic regression and neural network models achieved slightly lower ROC-AUCs of 0.907 and 0.838, respectively. The XG Boost and logistic regression models maintained their robust performance in subgroups of individuals with higher rates of chronic cough. Machine learning algorithms are one way of identifying conditions that are not coded in medical records, and can help identify individuals with chronic cough from claims data with a high degree of classification value.
Project description:Having accurate maps depicting the locations of residential buildings across a region benefits a range of sectors. This is particularly true for public health programs focused on delivering services at the household level, such as indoor residual spraying with insecticide to help prevent malaria. While open source data from OpenStreetMap (OSM) depicting the locations and shapes of buildings is rapidly improving in terms of quality and completeness globally, even in settings where all buildings have been mapped, information on whether these buildings are residential, commercial or another type is often only available for a small subset. Using OSM building data from Botswana and Swaziland, we identified buildings for which 'type' was indicated, generated via on the ground observations, and classified these into two classes, "sprayable" and "not-sprayable". Ensemble machine learning, using building characteristics such as size, shape and proximity to neighbouring features, was then used to form a model to predict which of these 2 classes every building in these two countries fell into. Results show that an ensemble machine learning approach performed marginally, but statistically, better than the best individual model and that using this ensemble model we were able to correctly classify >86% (using independent test data) of structures correctly as sprayable and not-sprayable across both countries.
Project description:To phenotype mechanistic differences between heart failure with reduced (HFrEF) and preserved (HFpEF) ejection fraction, a closed-loop model of the cardiovascular system coupled with patient-specific transthoracic echocardiography (TTE) and right heart catheterization (RHC) data was used to identify key parameters representing haemodynamics. Thirty-one patient records (10 HFrEF, 21 HFpEF) were obtained from the Cardiovascular Health Improvement Project database at the University of Michigan. Model simulations were tuned to match RHC and TTE pressure, volume, and cardiac output measurements in each patient. The underlying physiological model parameters were plotted against model-based norms and compared between HFrEF and HFpEF. Our results confirm the main mechanistic parameter driving HFrEF is reduced left ventricular (LV) contractility, whereas HFpEF exhibits a heterogeneous phenotype. Conducting principal component analysis, k -means clustering, and hierarchical clustering on the optimized parameters reveal (i) a group of HFrEF-like HFpEF patients (HFpEF1), (ii) a classic HFpEF group (HFpEF2), and (iii) a group of HFpEF patients that do not consistently cluster (NCC). These subgroups cannot be distinguished from the clinical data alone. Increased LV active contractility ( p<0.001 ) and LV passive stiffness ( p<0.001 ) at rest are observed when comparing HFpEF2 to HFpEF1. Analysing the clinical data of each subgroup reveals that elevated systolic and diastolic LV volumes seen in both HFrEF and HFpEF1 may be used as a biomarker to identify HFrEF-like HFpEF patients. These results suggest that modelling of the cardiovascular system and optimizing to standard clinical data can designate subgroups of HFpEF as separate phenotypes, possibly elucidating patient-specific treatment strategies. KEY POINTS: Analysis of data from right heart catheterization (RHC) and transthoracic echocardiography (TTE) of heart failure (HF) patients using a closed-loop model of the cardiovascular system identifies key parameters representing haemodynamic cardiovascular function in patients with heart failure with reduced and preserved ejection fraction (HFrEF and HFpEF). Analysing optimized parameters representing cardiovascular function using machine learning shows mechanistic differences between HFpEF groups that are not seen analysing clinical data alone. HFpEF groups presented here can be subdivided into three subgroups: HFpEF1 described as 'HFrEF-like HFpEF', HFpEF2 as 'classic HFpEF', and a third group of HFpEF patients that do not consistently cluster. Focusing purely on cardiac function consistently captures the underlying dysfunction in HFrEF, whereas HFpEF is better characterized by dysfunction in the entire cardiovascular system. Our methodology reveals that elevated left ventricular systolic and diastolic volumes are potential biomarkers for identifying HFrEF-like HFpEF patients.
Project description:The quality of treatment and prognosis after pediatric congenital heart surgery remains unsatisfactory. A reliable prediction model for postoperative complications of congenital heart surgery patients is essential to enable prompt initiation of therapy and improve the quality of prognosis. Here, we develop an interpretable machine-learning-based model that integrates patient demographics, surgery-specific features and intraoperative blood pressure data for accurately predicting complications after pediatric congenital heart surgery. We used blood pressure variability and the k-means algorithm combined with a smoothed formulation of dynamic time wrapping to extract features from time-series data. In addition, SHAP framework was used to provide explanations of the prediction. Our model achieved the best performance both in binary and multi-label classification compared with other consensus-based risk models. In addition, this explainable model explains why a prediction was made to help improve the clinical understanding of complication risk and generate actionable knowledge in practice. The combination of model performance and interpretability is easy for clinicians to trust and provide insight into how they should respond before the condition worsens after pediatric congenital heart surgery.
Project description:ObjectiveAnaphylaxis is a severe life-threatening allergic reaction, and its accurate identification in healthcare databases can harness the potential of "Big Data" for healthcare or public health purposes.MethodsThis study used claims data obtained between October 1, 2015 and February 28, 2019 from the CMS database to examine the utility of machine learning in identifying incident anaphylaxis cases. We created a feature selection pipeline to identify critical features between different datasets. Then a variety of unsupervised and supervised methods were used (eg, Sammon mapping and eXtreme Gradient Boosting) to train models on datasets of differing data quality, which reflects the varying availability and potential rarity of ground truth data in medical databases.ResultsResulting machine learning model accuracies ranged between 47.7% and 94.4% when tested on ground truth data. Finally, we found new features to help experts enhance existing case-finding algorithms.DiscussionDeveloping precise algorithms to detect medical outcomes in claims can be a laborious and expensive process, particularly for conditions presented and coded diversely. We found it beneficial to filter out highly potent codes used for data curation to identify underlying patterns and features. To improve rule-based algorithms where necessary, researchers could use model explainers to determine noteworthy features, which could then be shared with experts and included in the algorithm.ConclusionOur work suggests machine learning models can perform at similar levels as a previously published expert case-finding algorithm, while also having the potential to improve performance or streamline algorithm construction processes by identifying new relevant features for algorithm construction.