Project description:Pancreatic ductal adenocarcinoma (PDAC) is one of the most lethal cancer due to its rapid progression, marked potential for metastasis and the difficulty in diagnosis1-3. However, there are no effective liquid tests currently available for PDAC detection besides CA19-9. Here we introduce a noninvasive detection approach that employs machine learning plus untargeted and targeted serum lipidomics to establish an accurate method to detect PDAC.
Project description:<p>In the last decade, non-invasive prenatal diagnosis (NIPD) has emerged as an effective procedure for early detection of inherited diseases during pregnancy. This technique is based on using cell-free DNA (cfDNA) and fetal cfDNA (cffDNA) in maternal blood, and hence, has minimal risk for the mother and fetus compared with invasive techniques. NIPD is used today for identifying chromosomal abnormalities (in some instances) and for single-gene disorders (SGDs) of paternal origin. However, for SGDs of maternal origin, sensitivity poses a challenge that limits the testing to one genetic disorder at a time. Here we present a Bayesian method for the NIPD of monogenic diseases that is independent of the mode of inheritance and parental origin. Furthermore, we show that accounting for differences in the fragment length distribution of fetal- and maternal-derived cfDNA results in increased accuracy. Our model is the first to predict inherited insertions-deletions (indels). The method described can serve as a general framework for the NIPD of SGDs; this will facilitate easy integration of further improvements. One such improvement that is presented in the current study is a machine learning model that corrects errors based on patterns found in previously processed data. Overall, we show that next generation sequencing (NGS) can be used for the NIPD of a wide range of monogenic diseases, simultaneously. We believe that our study will lead to the achievement of a comprehensive NIPD for monogenic diseases.</p> <p>(Reprinted from Bayesian-based noninvasive prenatal diagnosis of single-gene disorders, with permission from Genome Research) </p>
Project description:Background and Aims: RNA biomarkers derived from sloughed enterocytes would provide an ideal, non-invasive method for early detection of colorectal cancer (CRC) and precancerous adenomas. To realize this goal, a highly reliable method to isolate preserved human RNA from stool samples is needed. Here we develop a protocol to identify RNA biomarkers associated with CRC to assess the use of these biomarkers for noninvasive screening of disease. Methods: Stool samples were collected from 454 patients prior to a colonoscopy. A nucleic acid extraction protocol was developed to isolate human RNA from 330 stool samples and transcript abundances were estimated by microarray analysis. This 330-patient cohort was split into a training set of 265 individuals to develop a machine learning model and a testing set of 65 individuals to determine the model’s ability to detect colorectal neoplasms. Results: Analysis of the transcriptome from 265 individuals identified 200 transcript clusters as differentially expressed (p<0.03). These transcripts were used to build a Support Vector Machine (SVM) based model to classify 65 individuals within the testing set. This SVM algorithm attained a 95% sensitivity for precancerous adenomas and a 65% sensitivity for CRC (stage I-IV). The machine learning algorithm attained a specificity of 59% for healthy individuals and an overall accuracy of 72.3%. Conclusions: We developed an RNA-based neoplasm detection model that is sensitive for CRC and precancerous adenomas. The model allows for non-invasive assessment of tumors and could potentially be used to provide clinical guidance for individuals within the screening population for colorectal cancer.
Project description:To identify genes with cell-lineage-specific expression not accessible by experimental micro-dissection, we developed a genome-scale iterative method, in-silico nano-dissection, which leverages high-throughput functional-genomics data from tissue homogenates using a machine-learning framework. This study applied nano-dissection to chronic kidney disease and identified transcripts specific to podocytes, key cells in the glomerular filter responsible for hereditary proteinuric syndromes and acquired CKD. In-silico prediction accuracy exceeded predictions derived from fluorescence-tagged-murine podocytes, identified genes recently implicated in hereditary glomerular disease and predicted genes significantly correlated with kidney function. The nano-dissection method is broadly applicable to define lineage specificity in many functional and disease contexts. We applied a machine-learning framework on high-throughput gene expression data from human kidney biopsy tissue homogenates and predict novel podocyte-specific genes. The prediction was validated by Human Protein Atlas at protein level. Prediction accuracy was compared with predictions derived from experimental approach using fluorescence-tagged-murine podocytes.
Project description:To identify genes with cell-lineage-specific expression not accessible by experimental micro-dissection, we developed a genome-scale iterative method, in-silico nano-dissection, which leverages high-throughput functional-genomics data from tissue homogenates using a machine-learning framework. This study applied nano-dissection to chronic kidney disease and identified transcripts specific to podocytes, key cells in the glomerular filter responsible for hereditary proteinuric syndromes and acquired CKD. In-silico prediction accuracy exceeded predictions derived from fluorescence-tagged-murine podocytes, identified genes recently implicated in hereditary glomerular disease and predicted genes significantly correlated with kidney function. The nano-dissection method is broadly applicable to define lineage specificity in many functional and disease contexts. We applied a machine-learning framework on high-throughput gene expression data from human kidney biopsy tissue homogenates and predict novel podocyte-specific genes. The prediction was validated by Human Protein Atlas at protein level. Prediction accuracy was compared with predictions derived from experimental approach using fluorescence-tagged-murine podocytes.
Project description:The RNA polymerase II (Pol II) core promoter is the strategic site of convergence of the signals that lead to the initiation of DNA transcription, but the downstream core promoter in humans has been difficult to understand. Here we analyse the human Pol II core promoter and use machine learning to generate predictive models for the downstream core promoter region (DPR) and the TATA box. We developed a method termed HARPE (high-throughput analysis of randomized promoter elements) to create hundreds of thousands of DPR (or TATA box) variants, each with known transcriptional strength. We then analysed the HARPE data by support vector regression (SVR) to provide comprehensive models for the sequence motifs, and found that the SVR-based approach is more effective than a consensus-based method for predicting transcriptional activity. These results show that the DPR is a functionally important core promoter element that is widely used in human promoters. Notably, there appears to be a duality between the DPR and the TATA box, as many promoters contain one or the other element. More broadly, these findings show that functional DNA motifs can be identified by machine learning analysis of a comprehensive set of sequence variants.
Project description:Detection of SARS-CoV-2 using RT–PCR and other advanced methods can achieve high accuracy. However, their application is limited in countries that lack sufficient resources to handle large-scale testing during the COVID-19 pandemic. Here, we describe a method to detect SARS-CoV-2 in nasal swabs using matrix-assisted laser desorption/ionization mass spectrometry (MALDI-MS) and machine learning analysis. This approach uses equipment and expertise commonly found in clinical laboratories in developing countries. We obtained mass spectra from a total of 362 samples (211 SARS-CoV-2-positive and 151 negative by RT–PCR) without prior sample preparation from three different laboratories. We tested two feature selection methods and six machine learning approaches to identify the top performing analysis approaches and determine the accuracy of SARS-CoV-2 detection. The support vector machine model provided the highest accuracy (93.9%), with 7% false positives and 5% false negatives. Our results suggest that MALDI-MS and machine learning analysis can be used to reliably detect SARS-CoV-2 in nasal swab samples.
Project description:Development of a novel machine learning guided ctDNA detection platform for use in liquid biopsy detection and therapeutic monitoring of solid tumors in several clinical contexts. Included are WGS alignments from our study.