Project description:(1) Background: Future missions to potentially habitable places in the Solar System require biochemistry-independent methods for detecting potential alien life forms. The technology was not advanced enough for onboard machine analysis of microscopic observations to be performed in past missions, but recent increases in computational power make the use of automated in-situ analyses feasible. (2) Methods: Here, we present a semi-automated experimental setup, capable of distinguishing the movement of abiotic particles due to Brownian motion from the motility behavior of the bacteria Pseudoalteromonas haloplanktis, Planococcus halocryophilus, Bacillus subtilis, and Escherichia coli. Supervised machine learning algorithms were also used to specifically identify these species based on their characteristic motility behavior. (3) Results: While we were able to distinguish microbial motility from the abiotic movements due to Brownian motion with an accuracy exceeding 99%, the accuracy of the automated identification rates for the selected species does not exceed 82%. (4) Conclusions: Motility is an excellent biosignature, which can be used as a tool for upcoming life-detection missions. This study serves as the basis for the further development of a microscopic life recognition system for upcoming missions to Mars or the ocean worlds of the outer Solar System.
Project description:BACKGROUND:Systemic sclerosis (SSc) is a rare disease with studies limited by small sample sizes. Electronic health records (EHRs) represent a powerful tool to study patients with rare diseases such as SSc, but validated methods are needed. We developed and validated EHR-based algorithms that incorporate billing codes and clinical data to identify SSc patients in the EHR. METHODS:We used a de-identified EHR with over 3 million subjects and identified 1899 potential SSc subjects with at least 1 count of the SSc ICD-9 (710.1) or ICD-10-CM (M34*) codes. We randomly selected 200 as a training set for chart review. A subject was a case if diagnosed with SSc by a rheumatologist, dermatologist, or pulmonologist. We selected the following algorithm components based on clinical knowledge and available data: SSc ICD-9 and ICD-10-CM codes, positive antinuclear antibody (ANA) (titer ≥ 1:80), and a keyword of Raynaud's phenomenon (RP). We performed both rule-based and machine learning techniques for algorithm development. Positive predictive values (PPVs), sensitivities, and F-scores (which account for PPVs and sensitivities) were calculated for the algorithms. RESULTS:PPVs were low for algorithms using only 1 count of the SSc ICD-9 code. As code counts increased, the PPVs increased. PPVs were higher for algorithms using ICD-10-CM codes versus the ICD-9 code. Adding a positive ANA and RP keyword increased the PPVs of algorithms only using ICD billing codes. Algorithms using ≥ 3 or ≥ 4 counts of the SSc ICD-9 or ICD-10-CM codes and ANA positivity had the highest PPV at 100% but a low sensitivity at 50%. The algorithm with the highest F-score of 91% was ≥ 4 counts of the ICD-9 or ICD-10-CM codes with an internally validated PPV of 90%. A machine learning method using random forests yielded an algorithm with a PPV of 84%, sensitivity of 92%, and F-score of 88%. The most important feature was RP keyword. CONCLUSIONS:Algorithms using only ICD-9 codes did not perform well to identify SSc patients. The highest performing algorithms incorporated clinical data with billing codes. EHR-based algorithms can identify SSc patients across a healthcare system, enabling researchers to examine important outcomes.
Project description:Microsporidia are unicellular fungi that are obligate endoparasites. Although nematodes are one of the most abundant and diverse animal groups, the only confirmed report of microsporidian infection was that of the "nematode killer" (Nematocida parisii). N. parisii was isolated from a wild Caenorhabditis sp. and causes an acute and lethal intestinal infection in a lab strain of Caenorhabditis elegans. We set out to characterize a microsporidian infection in a wild nematode to determine whether the infection pattern of N. parisii in the lab is typical of microsporidian infections in nematodes. We describe a novel microsporidian species named Sporanauta perivermis (marine spore of roundworms) and characterize its infection in its natural host, the free-living marine nematode Odontophora rectangula. S. perivermis is not closely related to N. parisii and differs strikingly in all aspects of infection. Examination by transmission electron microscopy (TEM) revealed that the infection was localized in the hypodermal and muscle tissues only and did not involve the intestines. Fluorescent in situ hybridization (FISH) confirmed infection in the muscle and hypodermis, and surprisingly, it also revealed that the parasite infects O. rectangula eggs, suggesting a vertical mode of transmission. Our observations highlight the importance of studying parasites in their natural hosts and indicate that not all nematode-infecting microsporidia are "nematode killers"; instead, microsporidiosis can be more versatile and chronic in the wild.
Project description:Microbial communities play key roles in ocean ecosystems through regulation of biogeochemical processes such as carbon and nutrient cycling, food web dynamics, and gut microbiomes of invertebrates, fish, reptiles, and mammals. Assessments of marine microbial diversity are therefore critical to understanding spatiotemporal variations in microbial community structure and function in ocean ecosystems. With recent advances in DNA shotgun sequencing for metagenome samples and computational analysis, it is now possible to access the taxonomic and genomic content of ocean microbial communities to study their structural patterns, diversity, and functional potential. However, existing taxonomic classification tools depend upon manually curated phylogenetic trees, which can create inaccuracies in metagenomes from less well-characterized communities, such as from ocean water. Herein, we explore the utility of deep learning tools-DeepMicrobes and a novel Residual Network architecture-that leverage natural language processing and convolutional neural network architectures to map input sequence data (k-mers) to output labels (taxonomic groups) without reliance on a curated taxonomic tree. We trained both models using metagenomic reads simulated from marine microbial genomes in the MarRef database. The performance of both models (accuracy, precision, and percent microbe predicted) was compared with the standard taxonomic classification tool Kraken2 using 10 complex metagenomic data sets simulated from MarRef. Our results demonstrate that time, compute power, and microbial genomic diversity still pose challenges for machine learning (ML). Moreover, our results suggest that high genome coverage and rectification of class imbalance are prerequisites for a well-trained model, and therefore should be a major consideration in future ML work. IMPORTANCE Taxonomic profiling of microbial communities is essential to model microbial interactions and inform habitat conservation. This work develops approaches in constructing training/testing data sets from publicly available marine metagenomes and evaluates the performance of machine learning (ML) approaches in read-based taxonomic classification of marine metagenomes. Predictions from two models are used to test accuracy in metagenomic classification and to guide improvements in ML approaches. Our study provides insights on the methods, results, and challenges of deep learning on marine microbial metagenomic data sets. Future machine learning approaches can be improved by rectifying genome coverage and class imbalance in the training data sets, developing alternative models, and increasing the accessibility of computational resources for model training and refinement.
Project description:BackgroundCluster randomized trials (CRTs) are becoming an increasingly important design. However, authors of CRTs do not always adhere to requirements to explicitly identify the design as cluster randomized in titles and abstracts, making retrieval from bibliographic databases difficult. Machine learning algorithms may improve their identification and retrieval. Therefore, we aimed to develop machine learning algorithms that accurately determine whether a bibliographic citation is a CRT report.MethodsWe trained, internally validated, and externally validated two convolutional neural networks and one support vector machine (SVM) algorithm to predict whether a citation is a CRT report or not. We exclusively used the information in an article citation, including the title, abstract, keywords, and subject headings. The algorithms' output was a probability from 0 to 1. We assessed algorithm performance using the area under the receiver operating characteristic (AUC) curves. Each algorithm's performance was evaluated individually and together as an ensemble. We randomly selected 5000 from 87,633 citations to train and internally validate our algorithms. Of the 5000 selected citations, 589 (12%) were confirmed CRT reports. We then externally validated our algorithms on an independent set of 1916 randomized trial citations, with 665 (35%) confirmed CRT reports.ResultsIn internal validation, the ensemble algorithm discriminated best for identifying CRT reports with an AUC of 98.6% (95% confidence interval: 97.8%, 99.4%), sensitivity of 97.7% (94.3%, 100%), and specificity of 85.0% (81.8%, 88.1%). In external validation, the ensemble algorithm had an AUC of 97.8% (97.0%, 98.5%), sensitivity of 97.6% (96.4%, 98.6%), and specificity of 78.2% (75.9%, 80.4%)). All three individual algorithms performed well, but less so than the ensemble.ConclusionsWe successfully developed high-performance algorithms that identified whether a citation was a CRT report with high sensitivity and moderately high specificity. We provide open-source software to facilitate the use of our algorithms in practice.
Project description:Diabetic nephropathy (DN), a multifaceted disease with various contributing factors, presents challenges in understanding its underlying causes. Uncovering biomarkers linked to this condition can shed light on its pathogenesis and support the creation of new diagnostic and treatment methods. Gene expression data were sourced from accessible public databases, and Weighted Gene Co-expression Network Analysis (WGCNA)was employed to pinpoint gene co-expression modules relevant to DN. Subsequently, various machine learning techniques, such as random forest, lasso regression algorithm (LASSO), and support vector machine-recursive feature elimination (SVM-REF), were utilized for distinguishing DN cases from controls using the identified gene modules. Additionally, functional enrichment analyses were conducted to explore the biological roles of these genes. Our analysis revealed 131 genes showing distinct expression patterns between controlled and uncontrolled groups. During the integrated WCGNA, we identified 61 co-expressed genes encompassing both categories. The enrichment analysis highlighted involvement in various immune responses and complex activities. Techniques like Random Forest, LASSO, and SVM-REF were applied to pinpoint key hub genes, leading to the identification of VWF and DNASE1L3. In the context of DN, they demonstrated significant consistency in both expression and function. Our research uncovered potential biomarkers for DN through the application of WGCNA and various machine learning methods. The results indicate that 2 central genes could serve as innovative diagnostic indicators and therapeutic targets for this disease. This discovery offers fresh perspectives on the development of DN and could contribute to the advancement of new diagnostic and treatment approaches.
Project description:Background and objectiveEssential tremor (ET) is a common movement syndrome, and the pathogenesis mechanisms, especially the brain network topological changes in ET are still unclear. The combination of graph theory (GT) analysis with machine learning (ML) algorithms provides a promising way to identify ET from healthy controls (HCs) at the individual level, and further help to reveal the topological pathogenesis in ET.MethodsResting-state functional magnetic resonance imaging (fMRI) data were obtained from 101 ET and 105 HCs. The topological properties were analyzed by using GT analysis, and the topological metrics under every single threshold and the area under the curve (AUC) of all thresholds were used as features. Then a Mann-Whitney U-test and least absolute shrinkage and selection operator (LASSO) were conducted to feature dimensionality reduction. Four ML algorithms were adopted to identify ET from HCs. The mean accuracy, mean balanced accuracy, mean sensitivity, mean specificity, and mean AUC were used to evaluate the classification performance. In addition, correlation analysis was carried out between selected topological features and clinical tremor characteristics.ResultsAll classifiers achieved good classification performance. The mean accuracy of Support vector machine (SVM), logistic regression (LR), random forest (RF), and naïve bayes (NB) was 84.65, 85.03, 84.85, and 76.31%, respectively. LR classifier achieved the best classification performance with 85.03% mean accuracy, 83.97% sensitivity, and an AUC of 0.924. Correlation analysis results showed that 2 topological features negatively and 1 positively correlated with tremor severity.ConclusionThese results demonstrated that combining topological metrics with ML algorithms could not only achieve high classification accuracy for discrimination ET from HCs but also help us to reveal the potential topological pathogenesis of ET.
Project description:Bacteria within the gut microbiota possess the ability to metabolize a wide array of human drugs, foods, and toxins, but the responsible enzymes for these chemical events remain largely uncharacterized due to the time-consuming nature of current experimental approaches. Attempts have been made in the past to computationally predict which bacterial species and enzymes are responsible for chemical transformations in the gut environment, but with low accuracy due to minimal chemical representation and sequence similarity search schemes. Here, we present an in silico approach that employs chemical and protein Similarity algorithms that Identify MicrobioMe Enzymatic Reactions (SIMMER). We show that SIMMER accurately predicts the responsible species and enzymes for a queried reaction, unlike previous methods. We demonstrate SIMMER use cases in the context of drug metabolism by predicting previously uncharacterized enzymes for 88 drug transformations known to occur in the human gut. We validate these predictions on external datasets and provide an in vitro validation of SIMMER's predictions for metabolism of methotrexate, an anti-arthritic drug. After demonstrating its utility and accuracy, we made SIMMER available as both a command-line and web tool, with flexible input and output options for determining chemical transformations within the human gut. We present SIMMER as a computational addition to the microbiome researcher's toolbox, enabling them to make informed hypotheses before embarking on the lengthy laboratory experiments required to characterize novel bacterial enzymes that can alter human ingested compounds.
Project description:The diversity of free-living nematodes in the beaches of two Antarctic islands, King George and Deception islands was investigated. We used morphological and molecular (LSU, and two fragments of SSU sequences) approaches to evaluate 236 nematodes. Specimens were assigned to at least genera using morphology and were assessed for the presence of cryptic speciation. The following genera were identified: Halomonhystera, Litoditis, Enoploides, Chromadorita, Theristus, Oncholaimus, Viscosia, Gammanema, Bathylaimus, Choanolaimus, and Paracanthonchus; along with specimens from the families Anticomidae and Linhomoeidae. Cryptic speciation was identified within the genera Halomonhystera and Litoditis. All of the cryptic species identified live sympatrically. The two cryptic species of Halomonhystera exhibited no significant morphological differences. However, Litoditis species 2 was significantly larger than Litoditis species 1. The utility of molecular data in confirming the identifications of some of the morphologically more challenging families of nematodes was demonstrated. In terms of which molecular sequences to use for the identification of free-living nematodes, the SSU sequences were more variable than the LSU sequences, and thus provided more resolution in the identification of cryptic speciation. Finally, despite the considerable amount of time and effort required to put together genetic and morphological data, the resulting advance in our understanding of diversity and ecology of free-living marine nematodes, makes that effort worthwhile.
Project description:The genus Hebeloma is renowned as difficult when it comes to species determination. Historically, many dichotomous keys have been published and used with varying success rate. Over the last 20 years the authors have built a database of Hebeloma collections containing not only metadata but also parametrized morphological descriptions, where for about a third of the cases micromorphological characters have been analysed and are included, as well as DNA sequences for almost every collection. The database now has about 9000 collections including nearly every type collection worldwide and represents over 120 different taxa. Almost every collection has been analysed and identified to species using a combination of the available molecular and morphological data in addition to locality and habitat information. Based on these data an Artificial Intelligence (AI) machine-learning species identifier has been developed that takes as input locality data and a small number of the morphological parameters. Using a random test set of more than 600 collections from the database, not utilized within the set of collections used to train the identifier, the species identifier was able to identify 77% correctly with its highest probabilistic match, 96% within its three most likely determinations and over 99% of collections within its five most likely determinations.