Project description:IntroductionPsychrophilic enzymes are a class of macromolecules with high catalytic activity at low temperatures. Cold-active enzymes possessing eco-friendly and cost-effective properties, are of huge potential application in detergent, textiles, environmental remediation, pharmaceutical as well as food industry. Compared with the time-consuming and labor-intensive experiments, computational modeling especially the machine learning (ML) algorithm is a high-throughput screening tool to identify psychrophilic enzymes efficiently.MethodsIn this study, the influence of 4 ML methods (support vector machines, K-nearest neighbor, random forest, and naïve Bayes), and three descriptors, i.e., amino acid composition (AAC), dipeptide combinations (DPC), and AAC + DPC on the model performance were systematically analyzed.Results and discussionAmong the 4 ML methods, the support vector machine model based on the AAC descriptor using 5-fold cross-validation achieved the best prediction accuracy with 80.6%. The AAC outperformed than the DPC and AAC + DPC descriptors regardless of the ML methods used. In addition, amino acid frequencies between psychrophilic and non-psychrophilic proteins revealed that higher frequencies of Ala, Gly, Ser, and Thr, and lower frequencies of Glu, Lys, Arg, Ile,Val, and Leu could be related to the protein psychrophilicity. Further, ternary models were also developed that could classify psychrophilic, mesophilic, and thermophilic proteins effectively. The predictive accuracy of the ternary classification model using AAC descriptor via the support vector machine algorithm was 75.8%. These findings would enhance our insight into the cold-adaption mechanisms of psychrophilic proteins and aid in the design of engineered cold-active enzymes. Moreover, the proposed model could be used as a screening tool to identify novel cold-adapted proteins.
Project description:BackgroundEarly unplanned hospital readmissions are associated with increased harm to patients, increased medical costs, and negative hospital reputation. With the identification of at-risk patients, a crucial step toward improving care, appropriate interventions can be adopted to prevent readmission. This study aimed to build machine learning models to predict 14-day unplanned readmissions.MethodsWe conducted a retrospective cohort study on 37,091 consecutive hospitalized adult patients with 55,933 discharges between September 1, 2018, and August 31, 2019, in an 1193-bed university hospital. Patients who were aged < 20 years, were admitted for cancer-related treatment, participated in clinical trial, were discharged against medical advice, died during admission, or lived abroad were excluded. Predictors for analysis included 7 categories of variables extracted from hospital's medical record dataset. In total, four machine learning algorithms, namely logistic regression, random forest, extreme gradient boosting, and categorical boosting, were used to build classifiers for prediction. The performance of prediction models for 14-day unplanned readmission risk was evaluated using precision, recall, F1-score, area under the receiver operating characteristic curve (AUROC), and area under the precision-recall curve (AUPRC).ResultsIn total, 24,722 patients were included for the analysis. The mean age of the cohort was 57.34 ± 18.13 years. The 14-day unplanned readmission rate was 1.22%. Among the 4 machine learning algorithms selected, Catboost had the best average performance in fivefold cross-validation (precision: 0.9377, recall: 0.5333, F1-score: 0.6780, AUROC: 0.9903, and AUPRC: 0.7515). After incorporating 21 most influential features in the Catboost model, its performance improved (precision: 0.9470, recall: 0.5600, F1-score: 0.7010, AUROC: 0.9909, and AUPRC: 0.7711).ConclusionsOur models reliably predicted 14-day unplanned readmissions and were explainable. They can be used to identify patients with a high risk of unplanned readmission based on influential features, particularly features related to diagnoses. The operation of the models with physiological indicators also corresponded to clinical experience and literature. Identifying patients at high risk with these models can enable early discharge planning and transitional care to prevent readmissions. Further studies should include additional features that may enable further sensitivity in identifying patients at a risk of early unplanned readmissions.
Project description:Rising global population and climate change realities dictate that agricultural productivity must be accelerated. Results from current traditional research approaches are difficult to extrapolate to all possible fields because they are dependent on specific soil types, weather conditions, and background management combinations that are not applicable nor translatable to all farms. A method that accurately evaluates the effectiveness of infinite cropping system interactions (involving multiple management practices) to increase maize and soybean yield across the US does not exist. Here, we utilize extensive databases and artificial intelligence algorithms and show that complex interactions, which cannot be evaluated in replicated trials, are associated with large crop yield variability and thus, potential for substantial yield increases. Our approach can accelerate agricultural research, identify sustainable practices, and help overcome future food demands.
Project description:Classification and quantitative characterization of neuronal morphologies from histological neuronal reconstruction is challenging since it is still unclear how to delineate a neuronal cell class and which are the best features to define them by. The morphological neuron characterization represents a primary source to address anatomical comparisons, morphometric analysis of cells, or brain modeling. The objectives of this paper are (i) to develop and integrate a pipeline that goes from morphological feature extraction to classification and (ii) to assess and compare the accuracy of machine learning algorithms to classify neuron morphologies. The algorithms were trained on 430 digitally reconstructed neurons subjectively classified into layers and/or m-types using young and/or adult development state population of the somatosensory cortex in rats. For supervised algorithms, linear discriminant analysis provided better classification results in comparison with others. For unsupervised algorithms, the affinity propagation and the Ward algorithms provided slightly better results.
Project description:Recently, many new cultivars have been taken abroad illegally, which is now considered an international issue. Botanical evidence found at a crime scene provides valuable information about the origin of the sample. However, botanical resources for forensic evidence remain underutilized because molecular markers, such as microsatellites, are not available without a limited set of species. Multiplexed intersimple sequence repeat (ISSR) genotyping by sequencing (MIG-seq) and its analysis method, identification of not applicable (iD-NA), have been used to determine several genome-wide genetic markers, making them applicable to all plant species, including those with limited available genetic information. Camellia cultivars are popular worldwide and are often planted in many gardens and bred to make new cultivars. In this study, we aimed to analyze Camellia cultivars/species through MIG-seq. MIG-seq could discriminate similar samples, such as bud mutants and closely related samples that could not be distinguished based on morphological features. This discrimination was consistent with that of a previous study that classified cultivars based on short tandem repeat (STR) markers, indicating that MIG-seq has the same or higher discrimination ability as STR markers. Furthermore, we observed unknown phylogenetic relationships. Because MIG-seq can be applied to unlimited species and low-quality DNA, it may be useful in various scientific fields.
Project description:Model-based sensitivity analysis is crucial in quantifying which input variability parameter is important for nondestructive testing (NDT) systems. In this work, neural networks (NN) and convolutional NN (CNN) are shown to be computationally efficient at making model prediction for NDT systems, when compared to models such as polynomial chaos expansions, Kriging and polynomial chaos Kriging (PC-Kriging). Three different ultrasonic benchmark cases are considered. NN outperform these three models for all the cases, while CNN outperformed these three models for two of the three cases. For the third case, it performed as well as PC-Kriging. NN required 48, 56 and 35 high-fidelity model evaluations, respectively, for the three cases to reach within
Project description:Background:Epilepsy is a disorder that can manifest as abnormalities in neurological or physical function. Stress cardiomyopathy is closely associated with neurological stimulation. However, the mechanisms underlying the interrelationship between epilepsy and stress cardiomyopathy are unclear. This paper aims to explore the genetic features and potential molecular mechanisms shared in epilepsy and stress cardiomyopathy. Methods:By analyzing the epilepsy dataset and stress cardiomyopathy dataset separately, the intersection of the two disease co-expressed differential genes is obtained, the co-expressed differential genes reveal the biological functions, the network is constructed, and the core modules are identified to reveal the interaction mechanism, the co-expressed genes with diagnostic validity are screened by machine learning algorithms, and the co-expressed genes are validated in parallel on the epilepsy single-cell data and the stress cardiomyopathy rat model. Results: Epilepsy causes stress cardiomyopathy, and its key pathways are Complement and coagulation cascades, HIF-1 signaling pathway, its key co-expressed genes include SPOCK2, CTSZ, HLA-DMB, ALDOA, SFRP1, ERBB3.The key immune cell subpopulations localized by single-cell data are the T_cells subgroup, Microglia subgroup, Macrophage subgroup, Astrocyte subgroup, and Oligodendrocytes subgroup. Conclusion: We believe epilepsy causing stress cardiomyopathy results from a multi-gene, multi-pathway combination. We identified the core co-expressed genes (SPOCK2, CTSZ, HLA-DMB, ALDOA, SFRP1, ERBB3) and the pathways that function in them (Complement and coagulation cascades, HIF-1 signaling pathway,JAK-STAT signaling pathway), and finally localized their key cellular subgroups(T_cells subgroup, Microglia subgroup, Macrophage subgroup, Astrocyte subgroup,and Oligodendrocytes subgroup). Also, combining cell subpopulations with hypercoagulability as well as sympathetic excitation further narrowed the cell subpopulations of related functions.
Project description:Allergic and irritant contact dermatitis induces different immunological cascades, involving a plethora of immune cells as well as keratinocytes. 96 patients were investigated using mRNA microarray experiments, of which 88 passed our QC criteria. Patients were topically exposed to allergens (nickel (Ni), epoxy resin (EP) and methylchloroisothiazolinone (CM)), irritants (sodium lauryl sulfate (SL) and nonanoic acid (NO)) for 48 hours or were left untreated.
Project description:BackgroundNo-show to medical appointments has significant adverse effects on healthcare systems and their clients. Using machine learning to predict no-shows allows managers to implement strategies such as overbooking and reminders targeting patients most likely to miss appointments, optimizing the use of resources.MethodsIn this study, we proposed a detailed analytical framework for predicting no-shows while addressing imbalanced datasets. The framework includes a novel use of z-fold cross-validation performed twice during the modeling process to improve model robustness and generalization. We also introduce Symbolic Regression (SR) as a classification algorithm and Instance Hardness Threshold (IHT) as a resampling technique and compared their performance with that of other classification algorithms, such as K-Nearest Neighbors (KNN) and Support Vector Machine (SVM), and resampling techniques, such as Random under Sampling (RUS), Synthetic Minority Oversampling Technique (SMOTE) and NearMiss-1. We validated the framework using two attendance datasets from Brazilian hospitals with no-show rates of 6.65% and 19.03%.ResultsFrom the academic perspective, our study is the first to propose using SR and IHT to predict the no-show of patients. Our findings indicate that SR and IHT presented superior performances compared to other techniques, particularly IHT, which excelled when combined with all classification algorithms and led to low variability in performance metrics results. Our results also outperformed sensitivity outcomes reported in the literature, with values above 0.94 for both datasets.ConclusionThis is the first study to use SR and IHT methods to predict patient no-shows and the first to propose performing z-fold cross-validation twice. Our study highlights the importance of avoiding relying on few validation runs for imbalanced datasets as it may lead to biased results and inadequate analysis of the generalization and stability of the models obtained during the training stage.
Project description:We propose a novel method that predicts binding of G-protein coupled receptors (GPCRs) and ligands. The proposed method uses hub and cycle structures of ligands and amino acid motif sequences of GPCRs, rather than the 3D structure of a receptor or similarity of receptors or ligands. The experimental results show that these new features can be effective in predicting GPCR-ligand binding (average area under the curve [AUC] of 0.944), because they are thought to include hidden properties of good ligand-receptor binding. Using the proposed method, we were able to identify novel ligand-GPCR bindings, some of which are supported by several studies.