Project description:Gene expression profiles were generated from 199 primary breast cancer patients. Samples 1-176 were used in another study, GEO Series GSE22820, and form the training data set in this study. Sample numbers 200-222 form a validation set. This data is used to model a machine learning classifier for Estrogen Receptor Status. RNA was isolated from 199 primary breast cancer patients. A machine learning classifier was built to predict ER status using only three gene features.
Project description:Protein-protein interactions (PPIs) may represent one of the next major classes of therapeutic targets. So far, only a minute fraction of the estimated 650,000 PPIs that comprise the human interactome are known with a tiny number of complexes being drugged. Such intricate biological systems cannot be cost-efficiently tackled using conventional high-throughput screening methods. Rather, time has come for designing new strategies that will maximize the chance for hit identification through a rationalization of the PPI inhibitor chemical space and the design of PPI-focused compound libraries (global or target-specific). Here, we train machine-learning-based models, mainly decision trees, using a dataset of known PPI inhibitors and of regular drugs in order to determine a global physico-chemical profile for putative PPI inhibitors. This statistical analysis unravels two important molecular descriptors for PPI inhibitors characterizing specific molecular shapes and the presence of a privileged number of aromatic bonds. The best model has been transposed into a computer program, PPI-HitProfiler, that can output from any drug-like compound collection a focused chemical library enriched in putative PPI inhibitors. Our PPI inhibitor profiler is challenged on the experimental screening results of 11 different PPIs among which the p53/MDM2 interaction screened within our own CDithem platform, that in addition to the validation of our concept led to the identification of 4 novel p53/MDM2 inhibitors. Collectively, our tool shows a robust behavior on the 11 experimental datasets by correctly profiling 70% of the experimentally identified hits while removing 52% of the inactive compounds from the initial compound collections. We strongly believe that this new tool can be used as a global PPI inhibitor profiler prior to screening assays to reduce the size of the compound collections to be experimentally screened while keeping most of the true PPI inhibitors. PPI-HitProfiler is freely available on request from our CDithem platform website, www.CDithem.com.
Project description:Sports sciences are increasingly data-intensive nowadays since computational tools can extract information from large amounts of data and derive insights from athlete performances during the competition. This paper addresses a performance prediction problem in soccer, a popular collective sport modality played by two teams competing against each other in the same field. In a soccer game, teams score points by placing the ball into the opponent's goal and the winner is the team with the highest count of goals. Retaining possession of the ball is one key to success, but it is not enough since a team needs to score to achieve victory, which requires an offensive toward the opponent's goal. The focus of this work is to determine if analyzing the first five seconds after the control of the ball is taken by one of the teams provides enough information to determine whether the ball will reach the final quarter of the soccer field, therefore creating a goal-scoring chance. By doing so, we can further investigate which conditions increase strategic leverage. Our approach comprises modeling players' interactions as graph structures and extracting metrics from these structures. These metrics, when combined, form time series that we encode in two-dimensional representations of visual rhythms, allowing feature extraction through deep convolutional networks, coupled with a classifier to predict the outcome (whether the final quarter of the field is reached). The results indicate that offensive play near the adversary penalty area can be predicted by looking at the first five seconds. Finally, the explainability of our models reveals the main metrics along with its contributions for the final inference result, which corroborates other studies found in the literature for soccer match analysis.
Project description:Antibodies are capable of potently and specifically binding individual antigens and, in some cases, disrupting their functions. The key challenge in generating antibody-based inhibitors is the lack of fundamental information relating sequences of antibodies to their unique properties as inhibitors. We develop a pipeline, Antibody Sequence Analysis Pipeline using Statistical testing and Machine Learning (ASAP-SML), to identify features that distinguish one set of antibody sequences from antibody sequences in a reference set. The pipeline extracts feature fingerprints from sequences. The fingerprints represent germline, CDR canonical structure, isoelectric point and frequent positional motifs. Machine learning and statistical significance testing techniques are applied to antibody sequences and extracted feature fingerprints to identify distinguishing feature values and combinations thereof. To demonstrate how it works, we applied the pipeline on sets of antibody sequences known to bind or inhibit the activities of matrix metalloproteinases (MMPs), a family of zinc-dependent enzymes that promote cancer progression and undesired inflammation under pathological conditions, against reference datasets that do not bind or inhibit MMPs. ASAP-SML identifies features and combinations of feature values found in the MMP-targeting sets that are distinct from those in the reference sets.
Project description:IntroductionThe advent of RNA sequencing (RNA-Seq) has significantly advanced our understanding of the transcriptomic landscape, revealing intricate gene expression patterns across biological states and conditions. However, the complexity and volume of RNA-Seq data pose challenges in identifying differentially expressed genes (DEGs), critical for understanding the molecular basis of diseases like cancer.MethodsWe introduce a novel Machine Learning-Enhanced Genomic Data Analysis Pipeline (ML-GAP) that incorporates autoencoders and innovative data augmentation strategies, notably the MixUp method, to overcome these challenges. By creating synthetic training examples through a linear combination of input pairs and their labels, MixUp significantly enhances the model's ability to generalize from the training data to unseen examples.ResultsOur results demonstrate the ML-GAP's superiority in accuracy, efficiency, and insights, particularly crediting the MixUp method for its substantial contribution to the pipeline's effectiveness, advancing greatly genomic data analysis and setting a new standard in the field.DiscussionThis, in turn, suggests that ML-GAP has the potential to perform more accurate detection of DEGs but also offers new avenues for therapeutic intervention and research. By integrating explainable artificial intelligence (XAI) techniques, ML-GAP ensures a transparent and interpretable analysis, highlighting the significance of identified genetic markers.
Project description:Nodules form on plant roots through the symbiotic relationship between soybean (Glycine max L. Merr.) roots and bacteria (Bradyrhizobium japonicum) and are an important structure where atmospheric nitrogen (N2) is fixed into bioavailable ammonia (NH3) for plant growth and development. Nodule quantification on soybean roots is a laborious and tedious task; therefore, assessment is frequently done on a numerical scale that allows for rapid phenotyping, but is less informative and suffers from subjectivity. We report the Soybean Nodule Acquisition Pipeline (SNAP) for nodule quantification that combines RetinaNet and UNet deep learning architectures for object (i.e., nodule) detection and segmentation. SNAP was built using data from 691 unique roots from diverse soybean genotypes, vegetative growth stages, and field locations and has a good model fit (R 2 = 0.99). SNAP reduces the human labor and inconsistencies of counting nodules, while acquiring quantifiable traits related to nodule growth, location, and distribution on roots. The ability of SNAP to phenotype nodules on soybean roots at a higher throughput enables researchers to assess the genetic and environmental factors, and their interactions on nodulation from an early development stage. The application of SNAP in research and breeding pipelines may lead to more nitrogen use efficiency for soybean and other legume species cultivars, as well as enhanced insight into the plant-Bradyrhizobium relationship.
Project description:We focused on building models that incorporated transcription factor (TF)-DNA interaction data for 12 members of the Auxin Response Factor (ARF) family from soybean as assessed by DNA Affinity Purification and sequencing (DAP-seq).
Project description:Unusual blood clots can cause serious health problems, such as lung embolism, stroke, and heart attack. Inhibiting thrombin activity was adopted as an effective strategy for preventing blood clots. In this study, we explored computational-based method for designing peptide inhibitors of human thrombin therapeutic peptides to prevent platelet aggregation. The random peptides and their 3-dimentional structures were generated to build a virtual peptide library. The generated peptides were docked into the binding pocket of human thrombin. The designed strong binding peptides were aligned with the native binder by comparative study, and we showed the top 5 peptide binders display strong binding affinity against human thrombin. The 5 peptides were synthesized and validated their inhibitory activity. Our result showed the 5-mer peptide AEGYA, EVVNQ, and FASRW with inhibitory activity against thrombin, range from 0.53 to 4.35 μM. In vitro anti-platelet aggregation assay was carried out, suggesting the 3 peptides can inhibit the platelet aggregation induced by thrombin. This study showed computer-aided peptide inhibitor design can be a robust method for finding potential binders for thrombin, which provided solutions for anticoagulation.
Project description:The discovery of peptide substrates for enzymes with exclusive, selective activities is a central goal in chemical biology. In this paper, we develop a hybrid computational and biochemical method to rapidly optimize peptides for specific, orthogonal biochemical functions. The method is an iterative machine learning process by which experimental data is deposited into a mathematical algorithm that selects potential peptide substrates to be tested experimentally. Once tested, the algorithm uses the experimental data to refine future selections. This process is repeated until a suitable set of de novo peptide substrates are discovered. We employed this technology to discover orthogonal peptide substrates for 4'-phosphopantetheinyl transferase, an enzyme class that covalently modifies proteins. In this manner, we have demonstrated that machine learning can be leveraged to guide peptide optimization for specific biochemical functions not immediately accessible by biological screening techniques, such as phage display and random mutagenesis.