Comparing a Query Compound with Drug Target Classes Using 3D-Chemical Similarity.
ABSTRACT: 3D similarity is useful in predicting the profiles of unprecedented molecular frameworks that are 2D dissimilar to known compounds. When comparing pairs of compounds, 3D similarity of the pairs depends on conformational sampling, the alignment method, the chosen descriptors, and the similarity coefficients. In addition to these four factors, 3D chemocentric target prediction of an unknown compound requires compound-target associations, which replace compound-to-compound comparisons with compound-to-target comparisons. In this study, quantitative comparison of query compounds to target classes (one-to-group) was achieved via two types of 3D similarity distributions for the respective target class with parameter optimization for the fitting models: (1) maximum likelihood (ML) estimation of queries, and (2) the Gaussian mixture model (GMM) of target classes. While Jaccard-Tanimoto similarity of query-to-ligand pairs with 3D structures (sampled multi-conformers) can be transformed into query distribution using ML estimation, the ligand pair similarity within each target class can be transformed into a representative distribution of a target class through GMM, which is hyperparameterized via the expectation-maximization (EM) algorithm. To quantify the discriminativeness of a query ligand against target classes, the Kullback-Leibler (K-L) divergence of each query was calculated and compared between targets. 3D similarity-based K-L divergence together with the probability and the feasibility index, (Fm), showed discriminative power with regard to some query-class associations. The K-L divergence of 3D similarity distributions can be an additional method for (1) the rank of the 3D similarity score or (2) the p-value of one 3D similarity distribution to predict the target of unprecedented drug scaffolds.
Project description:BACKGROUND: Tools to explore large compound databases in search for analogs of query molecules provide a strategically important support in drug discovery to help identify available analogs of any given reference or hit compound by ligand based virtual screening (LBVS). We recently showed that large databases can be formatted for very fast searching with various 2D-fingerprints using the city-block distance as similarity measure, in particular a 2D-atom pair fingerprint (APfp) and the related category extended atom pair fingerprint (Xfp) which efficiently encode molecular shape and pharmacophores, but do not perceive stereochemistry. Here we investigated related 3D-atom pair fingerprints to enable rapid stereoselective searches in the ZINC database (23.2 million 3D structures). RESULTS: Molecular fingerprints counting atom pairs at increasing through-space distance intervals were designed using either all atoms (16-bit 3DAPfp) or different atom categories (80-bit 3DXfp). These 3D-fingerprints retrieved molecular shape and pharmacophore analogs (defined by OpenEye ROCS scoring functions) of 110,000 compounds from the Cambridge Structural Database with equal or better accuracy than the 2D-fingerprints APfp and Xfp, and showed comparable performance in recovering actives from decoys in the DUD database. LBVS by 3DXfp or 3DAPfp similarity was stereoselective and gave very different analogs when starting from different diastereomers of the same chiral drug. Results were also different from LBVS with the parent 2D-fingerprints Xfp or APfp. 3D- and 2D-fingerprints also gave very different results in LBVS of folded molecules where through-space distances between atom pairs are much shorter than topological distances. CONCLUSIONS: 3DAPfp and 3DXfp are suitable for stereoselective searches for shape and pharmacophore analogs of query molecules in large databases. Web-browsers for searching ZINC by 3DAPfp and 3DXfp similarity are accessible at www.gdb.unibe.ch and should provide useful assistance to drug discovery projects. Graphical abstractAtom pair fingerprints based on through-space distances (3DAPfp) provide better shape encoding than atom pair fingerprints based on topological distances (APfp) as measured by the recovery of ROCS shape analogs by fp similarity.
Project description:The wwLigCSRre web server performs ligand-based screening using a 3D molecular similarity engine. Its aim is to provide an online versatile facility to assist the exploration of the chemical similarity of families of compounds, or to propose some scaffold hopping from a query compound. The service allows the user to screen several chemically diversified focused banks, such as Kinase-, CNS-, GPCR-, Ion-channel-, Antibacterial-, Anticancer- and Analgesic-focused libraries. The server also provides the possibility to screen the DrugBank and DSSTOX/Carcinogenic compounds databases. User banks can also been downloaded. The 3D similarity search combines both geometrical (3D) and physicochemical information. Starting from one 3D ligand molecule as query, the screening of such databases can lead to unraveled compound scaffold as hits or help to optimize previously identified hit molecules in a SAR (Structure activity relationship) project. wwLigCSRre can be accessed at http://bioserv.rpbs.univ-paris-diderot.fr/wwLigCSRre.html.
Project description:<h4>Objectives</h4>Uncertainty around clinical heterogeneity and outcomes for patients with JDM represents a major burden of disease and a challenge for clinical management. We sought to identify novel classes of patients having similar temporal patterns in disease activity and relate them to baseline clinical features.<h4>Methods</h4>Data were obtained for n = 519 patients, including baseline demographic and clinical features, baseline and follow-up records of physician's global assessment of disease (PGA), and skin disease activity (modified DAS). Growth mixture models (GMMs) were fitted to identify classes of patients with similar trajectories of these variables. Baseline predictors of class membership were identified using Lasso regression.<h4>Results</h4>GMM analysis of PGA identified two classes of patients. Patients in class 1 (89%) tended to improve, while patients in class 2 (11%) had more persistent disease. Lasso regression identified abnormal respiration, lipodystrophy and time since diagnosis as baseline predictors of class 2 membership, with estimated odds ratios, controlling for the other two variables, of 1.91 for presence of abnormal respiration, 1.92 for lipodystrophy and 1.32 for time since diagnosis. GMM analysis of modified DAS identified three classes of patients. Patients in classes 1 (16%) and 2 (12%) had higher levels of modified DAS at diagnosis that improved or remained high, respectively. Patients in class 3 (72%) began with lower DAS levels that improved more quickly. Higher proportions of patients in PGA class 2 were in DAS class 2 (19%, compared with 16 and 10%).<h4>Conclusion</h4>GMM analysis identified novel JDM phenotypes based on longitudinal PGA and modified DAS.
Project description:We describe and apply a scaffold-focused virtual screen based upon scaffold trees to the mitotic kinase TTK (MPS1). Using level 1 of the scaffold tree, we perform both 2D and 3D similarity searches between a query scaffold and a level 1 scaffold library derived from a 2 million compound library; 98 compounds from 27 unique top-ranked level 1 scaffolds are selected for biochemical screening. We show that this scaffold-focused virtual screen prospectively identifies eight confirmed active compounds that are structurally differentiated from the query compound. In comparison, 100 compounds were selected for biochemical screening using a virtual screen based upon whole molecule similarity resulting in 12 confirmed active compounds that are structurally similar to the query compound. We elucidated the binding mode for four of the eight confirmed scaffold hops to TTK by determining their protein-ligand crystal structures; each represents a ligand-efficient scaffold for inhibitor design.
Project description:Population heterogeneity in growth trajectories can be detected with growth mixture modeling (GMM). It is common that researchers compute composite scores of repeated measures and use them as multiple indicators of growth factors (baseline performance and growth) assuming measurement invariance between latent classes. Considering that the assumption of measurement invariance does not always hold, we investigate the impact of measurement noninvariance on class enumeration and parameter recovery in GMM through a Monte Carlo simulation study (Study 1). In Study 2, we examine the class enumeration and parameter recovery of the second-order growth mixture modeling (SOGMM) that incorporates measurement models at the first order level. Thus, SOGMM estimates growth trajectory parameters with reliable sources of variance, that is, common factor variance of repeated measures and allows heterogeneity in measurement parameters between latent classes. The class enumeration rates are examined with information criteria such as AIC, BIC, sample-size adjusted BIC, and hierarchical BIC under various simulation conditions. The results of Study 1 showed that the parameter estimates of baseline performance and growth factor means were biased to the degree of measurement noninvariance even when the correct number of latent classes was extracted. In Study 2, the class enumeration accuracy of SOGMM depended on information criteria, class separation, and sample size. The estimates of baseline performance and growth factor mean differences between classes were generally unbiased but the size of measurement noninvariance was underestimated. Overall, SOGMM is advantageous in that it yields unbiased estimates of growth trajectory parameters and more accurate class enumeration compared to GMM by incorporating measurement models.
Project description:BACKGROUND: Virtual screening methods are now well established as effective to identify hit and lead candidates and are fully integrated in most drug discovery programs. Ligand-based approaches make use of physico-chemical, structural and energetics properties of known active compounds to search large chemical libraries for related and novel chemotypes. While 2D-similarity search tools are known to be fast and efficient, the use of 3D-similarity search methods can be very valuable to many research projects as integration of "3D knowledge" can facilitate the identification of not only related molecules but also of chemicals possessing distant scaffolds as compared to the query and therefore be more inclined to scaffolds hopping. To date, very few methods performing this task are easily available to the scientific community. RESULTS: We introduce a new approach (LigCSRre) to the 3D ligand similarity search of drug candidates. It combines a 3D maximum common substructure search algorithm independent on atom order with a tunable description of atomic compatibilities to prune the search and increase its physico-chemical relevance. We show, on 47 experimentally validated active compounds across five protein targets having different specificities, that for single compound search, the approach is able to recover on average 52% of the co-actives in the top 1% of the ranked list which is better than gold standards of the field. Moreover, the combination of several runs on a single protein target using different query active compounds shows a remarkable improvement in enrichment. Such Results demonstrate LigCSRre as a valuable tool for ligand-based screening. CONCLUSION: LigCSRre constitutes a new efficient and generic approach to the 3D similarity screening of small compounds, whose flexible design opens the door to many enhancements. The program is freely available to the academics for non-profit research at: http://bioserv.rpbs.univ-paris-diderot.fr/LigCSRre.html.
Project description:<h4>Objective</h4>To assess the performance of content-based image retrieval (CBIR) of chest CT for diffuse interstitial lung disease (DILD).<h4>Materials and methods</h4>The database was comprised by 246 pairs of chest CTs (initial and follow-up CTs within two years) from 246 patients with usual interstitial pneumonia (UIP, n = 100), nonspecific interstitial pneumonia (NSIP, n = 101), and cryptogenic organic pneumonia (COP, n = 45). Sixty cases (30-UIP, 20-NSIP, and 10-COP) were selected as the queries. The CBIR retrieved five similar CTs as a query from the database by comparing six image patterns (honeycombing, reticular opacity, emphysema, ground-glass opacity, consolidation and normal lung) of DILD, which were automatically quantified and classified by a convolutional neural network. We assessed the rates of retrieving the same pairs of query CTs, and the number of CTs with the same disease class as query CTs in top 1-5 retrievals. Chest radiologists evaluated the similarity between retrieved CTs and queries using a 5-scale grading system (5-almost identical; 4-same disease; 3-likelihood of same disease is half; 2-likely different; and 1-different disease).<h4>Results</h4>The rate of retrieving the same pairs of query CTs in top 1 retrieval was 61.7% (37/60) and in top 1-5 retrievals was 81.7% (49/60). The CBIR retrieved the same pairs of query CTs more in UIP compared to NSIP and COP (<i>p</i> = 0.008 and 0.002). On average, it retrieved 4.17 of five similar CTs from the same disease class. Radiologists rated 71.3% to 73.0% of the retrieved CTs with a similarity score of 4 or 5.<h4>Conclusion</h4>The proposed CBIR system showed good performance for retrieving chest CTs showing similar patterns for DILD.
Project description:Histone deacetylases (HDACs) are part of a vast family of enzymes with crucial roles in numerous biological processes, largely through their repressive influence on transcription, with serious implications in a variety of human diseases. Among different isoforms, human HDAC2 in particular draws attention as a promising target for the treatment of cancer and memory deficits associated with neurodegenerative diseases. Now the challenge is to obtain a compound that is structurally novel and truly selective to HDAC2 because most current HDAC2 inhibitors do not show isoforms selectivity and suffer from metabolic instability. In order to identify novel, and isoform-selective inhibitors for human HDAC2, we designed a shape-based hybrid query from multiple scaffolds of known chemical classes and validated it to be a more effective approach to discover diverse scaffolds than single-molecule query. The hybrid query-based screening rendered a hit compound with the N-benzylaniline scaffold which showed moderate inhibitory activity against HDAC2, and its chemical structure is diverse compared to known HDAC2 inhibitors. Notably, this compound shows the selectivity against the HDAC6, a Class II enzyme, thus has the potential to further develop into the class- and isoform-selective inhibitors. Our present study supplies an useful approach to identifying novel HDAC2 inhibitors, and can be extended to the inquires of other important biomedical targets as well.
Project description:<h4>Background</h4>Gene duplication provides raw material for the generation of new functions, but most duplicates are rapidly lost due to the initial redundancy in gene function. How gene function diversifies following duplication is largely unclear. Previous studies analyzed the diversification of duplicates by characterizing their coding sequence divergence. However, functional divergence can also be attributed to changes in regulatory properties, such as protein localization or expression, which require only minor changes in gene sequence.<h4>Results</h4>We developed a novel method to compare expression profiles from different organisms and applied it to analyze the expression divergence of yeast duplicated genes. The expression profiles of Saccharomyces cerevisiae duplicate pairs were compared with those of their pre-duplication orthologs in Candida albicans. Duplicate pairs were classified into two classes, corresponding to symmetric versus asymmetric rates of expression divergence. The latter class includes 43 duplicate pairs in which only one copy has a significant expression similarity to the C. albicans ortholog. These may present cases of regulatory neofunctionalization, as supported also by their dispensability and variability.<h4>Conclusion</h4>Duplicated genes may diversify through regulatory neofunctionalization. Notably, the asymmetry of gene sequence evolution and the asymmetry of gene expression evolution are only weakly correlated, underscoring the importance of expression analysis to elucidate the evolution of novel functions.
Project description:Virtually every molecular biologist has searched a protein or DNA sequence database to find sequences that are evolutionarily related to a given query. Pairwise sequence comparison methods--i.e., measures of similarity between query and target sequences--provide the engine for sequence database search and have been the subject of 30 years of computational research. For the difficult problem of detecting remote evolutionary relationships between protein sequences, the most successful pairwise comparison methods involve building local models (e.g., profile hidden Markov models) of protein sequences. However, recent work in massive data domains like web search and natural language processing demonstrate the advantage of exploiting the global structure of the data space. Motivated by this work, we present a large-scale algorithm called ProtEmbed, which learns an embedding of protein sequences into a low-dimensional "semantic space." Evolutionarily related proteins are embedded in close proximity, and additional pieces of evidence, such as 3D structural similarity or class labels, can be incorporated into the learning process. We find that ProtEmbed achieves superior accuracy to widely used pairwise sequence methods like PSI-BLAST and HHSearch for remote homology detection; it also outperforms our previous RankProp algorithm, which incorporates global structure in the form of a protein similarity network. Finally, the ProtEmbed embedding space can be visualized, both at the global level and local to a given query, yielding intuition about the structure of protein sequence space.