Determining chemical reactivity driving biological activity from SMILES transformations: the bonding mechanism of anti-HIV pyrimidines.
ABSTRACT: Assessing the molecular mechanism of a chemical-biological interaction and bonding stands as the ultimate goal of any modern quantitative structure-activity relationship (QSAR) study. To this end the present work employs the main chemical reactivity structural descriptors (electronegativity, chemical hardness, chemical power, electrophilicity) to unfold the variational QSAR though their min-max correspondence principles as applied to the Simplified Molecular Input Line Entry System (SMILES) transformation of selected uracil derivatives with anti-HIV potential with the aim of establishing the main stages whereby the given compounds may inhibit HIV infection. The bonding can be completely described by explicitly considering by means of basic indices and chemical reactivity principles two forms of SMILES structures of the pyrimidines, the Longest SMILES Molecular Chain (LoSMoC) and the Branching SMILES (BraS), respectively, as the effective forms involved in the anti-HIV activity mechanism and according to the present work, also necessary intermediates in molecular pathways targeting/docking biological sites of interest.
Project description:Variational quantitative binding-conformational analysis for a series of anti-HIV pyrimidine-based ligands is advanced at the individual molecular level. This was achieved by employing ligand-receptor docking algorithms for each molecule in the 1,3-disubstituted uracil derivative series that was studied. Such computational algorithms were employed for analyzing both genuine molecular cases and their simplified molecular input line entry system (SMILES) transformations, which were created via the controlled breaking of chemical bonds, so as to generate the longest SMILES molecular chain (LoSMoC) and Branching SMILES (BraS) conformations. The study identified the most active anti-HIV molecules, and analyzed their special and relevant bonding fragments (chemical alerts), and the recorded energetic and geometric docking results (i.e., binding and affinity energies, and the surface area and volume of bonding, respectively). Clear computational evidence was also produced concerning the ligand-receptor pocket binding efficacies of the LoSMoc and BraS conformation types, thus confirming their earlier presence (as suggested by variational quantitative structure-activity relationship, variational-QSAR) as active intermediates for the molecule-to-cell transduction process.
Project description:UNLABELLED: BACKGROUND: There are two line notations of chemical structures that have established themselves in the field: the SMILES string and the InChI string. The InChI aims to provide a unique, or canonical, identifier for chemical structures, while SMILES strings are widely used for storage and interchange of chemical structures, but no standard exists to generate a canonical SMILES string. RESULTS: I describe how to use the InChI canonicalisation to derive a canonical SMILES string in a straightforward way, either incorporating the InChI normalisations (Inchified SMILES) or not (Universal SMILES). This is the first description of a method to generate canonical SMILES that takes stereochemistry into account. When tested on the 1.1 m compounds in the ChEMBL database, and a 1 m compound subset of the PubChem Substance database, no canonicalisation failures were found with Inchified SMILES. Using Universal SMILES, 99.79% of the ChEMBL database was canonicalised successfully and 99.77% of the PubChem subset. CONCLUSIONS: The InChI canonicalisation algorithm can successfully be used as the basis for a common standard for canonical SMILES. While challenges remain - such as the development of a standard aromatic model for SMILES - the ability to create the same SMILES using different toolkits will mean that for the first time it will be possible to easily compare the chemical models used by different toolkits.
Project description:Recurrent Neural Networks (RNNs) trained with a set of molecules represented as unique (canonical) SMILES strings, have shown the capacity to create large chemical spaces of valid and meaningful structures. Herein we perform an extensive benchmark on models trained with subsets of GDB-13 of different sizes (1 million, 10,000 and 1000), with different SMILES variants (canonical, randomized and DeepSMILES), with two different recurrent cell types (LSTM and GRU) and with different hyperparameter combinations. To guide the benchmarks new metrics were developed that define how well a model has generalized the training set. The generated chemical space is evaluated with respect to its uniformity, closedness and completeness. Results show that models that use LSTM cells trained with 1 million randomized SMILES, a non-unique molecular string representation, are able to generalize to larger chemical spaces than the other approaches and they represent more accurately the target chemical space. Specifically, a model was trained with randomized SMILES that was able to generate almost all molecules from GDB-13 with a quasi-uniform probability. Models trained with smaller samples show an even bigger improvement when trained with randomized SMILES models. Additionally, models were trained on molecules obtained from ChEMBL and illustrate again that training with randomized SMILES lead to models having a better representation of the drug-like chemical space. Namely, the model trained with randomized SMILES was able to generate at least double the amount of unique molecules with the same distribution of properties comparing to one trained with canonical SMILES.
Project description:A smile is the most frequent facial expression, but not all smiles are equal. A social-functional account holds that smiles of reward, affiliation, and dominance serve basic social functions, including rewarding behavior, bonding socially, and negotiating hierarchy. Here, we characterize the facial-expression patterns associated with these three types of smiles. Specifically, we modeled the facial expressions using a data-driven approach and showed that reward smiles are symmetrical and accompanied by eyebrow raising, affiliative smiles involve lip pressing, and dominance smiles are asymmetrical and contain nose wrinkling and upper-lip raising. A Bayesian-classifier analysis and a detection task revealed that the three smile types are highly distinct. Finally, social judgments made by a separate participant group showed that the different smile types convey different social messages. Our results provide the first detailed description of the physical form and social messages conveyed by these three types of functional smiles and document the versatility of these facial expressions.
Project description:Recurrent neural networks have been widely used to generate millions of de novo molecules in defined chemical spaces. Reported deep generative models are exclusively based on LSTM and/or GRU units and frequently trained using canonical SMILES. In this study, we introduce Generative Examination Networks (GEN) as a new approach to train deep generative networks for SMILES generation. In our GENs, we have used an architecture based on multiple concatenated bidirectional RNN units to enhance the validity of generated SMILES. GENs autonomously learn the target space in a few epochs and are stopped early using an independent online examination mechanism, measuring the quality of the generated set. Herein we have used online statistical quality control (SQC) on the percentage of valid molecular SMILES as examination measure to select the earliest available stable model weights. Very high levels of valid SMILES (95–98%) can be generated using multiple parallel encoding layers in combination with SMILES augmentation using unrestricted SMILES randomization. Our trained models combine an excellent novelty rate (85–90%) while generating SMILES with strong conservation of the property space (95–99%). In GENs, both the generative network and the examination mechanism are open to other architectures and quality criteria.
Project description:Computer descriptions of chemical molecular connectivity are necessary for searching chemical databases and for predicting chemical properties from molecular structure. In this article, the ongoing work to describe the chemical connectivity of entries contained in the Crystallography Open Database (COD) in SMILES format is reported. This collection of SMILES is publicly available for chemical (substructure) search or for any other purpose on an open-access basis, as is the COD itself. The conventions that have been followed for the representation of compounds that do not fit into the valence bond theory are outlined for the most frequently found cases. The procedure for getting the SMILES out of the CIF files starts with checking whether the atoms in the asymmetric unit are a chemically acceptable image of the compound. When they are not (molecule in a symmetry element, disorder, polymeric species,etc.), the previously published cif_molecule program is used to get such image in many cases. The program package Open Babel is then applied to get SMILES strings from the CIF files (either those directly taken from the COD or those produced by cif_molecule when applicable). The results are then checked and/or fixed by a human editor, in a computer-aided task that at present still consumes a great deal of human time. Even if the procedure still needs to be improved to make it more automatic (and hence faster), it has already yielded more than 160,000 curated chemical structures and the purpose of this article is to announce the existence of this work to the chemical community as well as to spread the use of its results.
Project description:Molecular generative models trained with small sets of molecules represented as SMILES strings can generate large regions of the chemical space. Unfortunately, due to the sequential nature of SMILES strings, these models are not able to generate molecules given a scaffold (i.e., partially-built molecules with explicit attachment points). Herein we report a new SMILES-based molecular generative architecture that generates molecules from scaffolds and can be trained from any arbitrary molecular set. This approach is possible thanks to a new molecular set pre-processing algorithm that exhaustively slices all possible combinations of acyclic bonds of every molecule, combinatorically obtaining a large number of scaffolds with their respective decorations. Moreover, it serves as a data augmentation technique and can be readily coupled with randomized SMILES to obtain even better results with small sets. Two examples showcasing the potential of the architecture in medicinal and synthetic chemistry are described: First, models were trained with a training set obtained from a small set of Dopamine Receptor D2 (DRD2) active modulators and were able to meaningfully decorate a wide range of scaffolds and obtain molecular series predicted active on DRD2. Second, a larger set of drug-like molecules from ChEMBL was selectively sliced using synthetic chemistry constraints (RECAP rules). In this case, the resulting scaffolds with decorations were filtered only to allow those that included fragment-like decorations. This filtering process allowed models trained with this dataset to selectively decorate diverse scaffolds with fragments that were generally predicted to be synthesizable and attachable to the scaffold using known synthetic approaches. In both cases, the models were already able to decorate molecules using specific knowledge without the need to add it with other techniques, such as reinforcement learning. We envision that this architecture will become a useful addition to the already existent architectures for de novo molecular generation.
Project description:Chemical autoencoders are attractive models as they combine chemical space navigation with possibilities for de novo molecule generation in areas of interest. This enables them to produce focused chemical libraries around a single lead compound for employment early in a drug discovery project. Here, it is shown that the choice of chemical representation, such as strings from the simplified molecular-input line-entry system (SMILES), has a large influence on the properties of the latent space. It is further explored to what extent translating between different chemical representations influences the latent space similarity to the SMILES strings or circular fingerprints. By employing SMILES enumeration for either the encoder or decoder, it is found that the decoder has the largest influence on the properties of the latent space. Training a sequence to sequence heteroencoder based on recurrent neural networks (RNNs) with long short-term memory cells (LSTM) to predict different enumerated SMILES strings from the same canonical SMILES string gives the largest similarity between latent space distance and molecular similarity measured as circular fingerprints similarity. Using the output from the code layer in quantitative structure activity relationship (QSAR) of five molecular datasets shows that heteroencoder derived vectors markedly outperforms autoencoder derived vectors as well as models built using ECFP4 fingerprints, underlining the increased chemical relevance of the latent space. However, the use of enumeration during training of the decoder leads to a marked increase in the rate of decoding to different molecules than encoded, a tendency that can be counteracted with more complex network architectures.
Project description:The data have been obtained from the Sigma-2 Receptor Selective Ligands Database (S2RSLDB) and refined according to the QSAR requirements. These data provide information about a set of 548 Sigma-2 (?2) receptor ligands selective over Sigma-1 (?1) receptor. The development of the QSAR model has been undertaken with the use of CORAL software using SMILES, molecular graphs and hybrid descriptors (SMILES and graph together). Data here reported include the regression for ?2 receptor pKi QSAR models. The QSAR model was also employed to predict the ?2 receptor pKi values of the FDA approved drugs that are herewith included.
Project description:The data have been obtained from the Heme Oxygenase Database (HemeOxDB) and refined according to the 2D-QSAR requirements. These data provide information about a set of more than 380 Heme Oxygenase-1 (HO-1) inhibitors. The development of the 2D-QSAR model has been undertaken with the use of CORAL software using SMILES, molecular graphs and hybrid descriptors (SMILES and graph together). The 2D-QSAR model regressions for HO-1 half maximal inhibitory concentration (IC50) expressed as pIC50 (pIC50=-LogIC50) are here included. The 2D-QSAR model was also employed to predict the HO-1 pIC50values of the FDA approved drugs that are herewith reported.