Farseer-NMR: automatic treatment, analysis and plotting of large, multi-variable NMR data.
ABSTRACT: We present Farseer-NMR ( https://git.io/vAueU ), a software package to treat, evaluate and combine NMR spectroscopic data from sets of protein-derived peaklists covering a range of experimental conditions. The combined advances in NMR and molecular biology enable the study of complex biomolecular systems such as flexible proteins or large multibody complexes, which display a strong and functionally relevant response to their environmental conditions, e.g. the presence of ligands, site-directed mutations, post translational modifications, molecular crowders or the chemical composition of the solution. These advances have created a growing need to analyse those systems' responses to multiple variables. The combined analysis of NMR peaklists from large and multivariable datasets has become a new bottleneck in the NMR analysis pipeline, whereby information-rich NMR-derived parameters have to be manually generated, which can be tedious, repetitive and prone to human error, or even unfeasible for very large datasets. There is a persistent gap in the development and distribution of software focused on peaklist treatment, analysis and representation, and specifically able to handle large multivariable datasets, which are becoming more commonplace. In this regard, Farseer-NMR aims to close this longstanding gap in the automated NMR user pipeline and, altogether, reduce the time burden of analysis of large sets of peaklists from days/weeks to seconds/minutes. We have implemented some of the most common, as well as new, routines for calculation of NMR parameters and several publication-quality plotting templates to improve NMR data representation. Farseer-NMR has been written entirely in Python and its modular code base enables facile extension.
Project description:Sparsely labeled NMR samples provide opportunities to study larger biomolecular assemblies than is traditionally done by NMR. This requires new computational tools that can handle the sparsity and ambiguity in the NMR datasets. The MELD (modeling employing limited data) Bayesian approach was assessed to be the best performing in predicting structures from sparsely labeled NMR data in the 13th edition of the Critical Assessment of Structure Prediction (CASP) event-and limitations of the methodology were also noted. In this report, we evaluate the nature and difficulty in modeling unassigned sparsely labeled NMR datasets and report on an improved methodological pipeline leading to higher-accuracy predictions. We benchmark our methodology against the NMR datasets provided by CASP 13.
Project description:Automated segmentation of cellular electron microscopy (EM) datasets remains a challenge. Supervised deep learning (DL) methods that rely on region-of-interest (ROI) annotations yield models that fail to generalize to unrelated datasets. Newer unsupervised DL algorithms require relevant pre-training images, however, pre-training on currently available EM datasets is computationally expensive and shows little value for unseen biological contexts, as these datasets are large and homogeneous. To address this issue, we present CEM500K, a nimble 25 GB dataset of 0.5 × 10<sup>6</sup> unique 2D cellular EM images curated from nearly 600 three-dimensional (3D) and 10,000 two-dimensional (2D) images from >100 unrelated imaging projects. We show that models pre-trained on CEM500K learn features that are biologically relevant and resilient to meaningful image augmentations. Critically, we evaluate transfer learning from these pre-trained models on six publicly available and one newly derived benchmark segmentation task and report state-of-the-art results on each. We release the CEM500K dataset, pre-trained models and curation pipeline for model building and further expansion by the EM community. Data and code are available at https://www.ebi.ac.uk/pdbe/emdb/empiar/entry/10592/ and https://git.io/JLLTz.
Project description:NMR-I-TASSER, an adaption of the I-TASSER algorithm combining NMR data for protein structure determination, recently joined the second round of the CASD-NMR experiment. Unlike many molecular dynamics-based methods, NMR-I-TASSER takes a molecular replacement-like approach to the problem by first threading the target through the PDB to identify structural templates which are then used for iterative NOE assignments and fragment structure assembly refinements. The employment of multiple templates allows NMR-I-TASSER to sample different topologies while convergence to a single structure is not required. Retroactive and blind tests of the CASD-NMR targets from Rounds 1 and 2 demonstrate that even without using NOE peak lists I-TASSER can generate correct structure topology with 15 of 20 targets having a TM-score above 0.5. With the addition of NOE-based distance restraints, NMR-I-TASSER significantly improved the I-TASSER models with all models having the TM-score above 0.5. The average RMSD was reduced from 5.29 to 2.14 Å in Round 1 and 3.18 to 1.71 Å in Round 2. There is no obvious difference in the modeling results with using raw and refined peak lists, indicating robustness of the pipeline to the NOE assignment errors. Overall, despite the low-resolution modeling the current NMR-I-TASSER pipeline provides a coarse-grained structure folding approach complementary to traditional molecular dynamics simulations, which can produce fast near-native frameworks for atomic-level structural refinement.
Project description:As part of efforts to develop improved methods for NMR protein sample preparation and structure determination, the Northeast Structural Genomics Consortium (NESG) has implemented an NMR screening pipeline for protein target selection, construct optimization, and buffer optimization, incorporating efficient microscale NMR screening of proteins using a micro-cryoprobe. The process is feasible because the newest generation probe requires only small amounts of protein, typically 30-200 microg in 8-35 microl volume. Extensive automation has been made possible by the combination of database tools, mechanization of key process steps, and the use of a micro-cryoprobe that gives excellent data while requiring little optimization and manual setup. In this perspective, we describe the overall process used by the NESG for screening NMR samples as part of a sample optimization process, assessing optimal construct design and solution conditions, as well as for determining protein rotational correlation times in order to assess protein oligomerization states. Database infrastructure has been developed to allow for flexible implementation of new screening protocols and harvesting of the resulting output. The NESG micro NMR screening pipeline has also been used for detergent screening of membrane proteins. Descriptions of the individual steps in the NESG NMR sample design, production, and screening pipeline are presented in the format of a standard operating procedure.
Project description:Industry 4.0 is all about interconnectivity, sensor-enhanced process control, and data-driven systems. Process analytical technology (PAT) such as online nuclear magnetic resonance (NMR) spectroscopy is gaining in importance, as it increasingly contributes to automation and digitalization in production. In many cases up to now, however, a classical evaluation of process data and their transformation into knowledge is not possible or not economical due to the insufficiently large datasets available. When developing an automated method applicable in process control, sometimes only the basic data of a limited number of batch tests from typical product and process development campaigns are available. However, these datasets are not large enough for training machine-supported procedures. In this work, to overcome this limitation, a new procedure was developed, which allows physically motivated multiplication of the available reference data in order to obtain a sufficiently large dataset for training machine learning algorithms. The underlying example chemical synthesis was measured and analyzed with both application-relevant low-field NMR and high-field NMR spectroscopy as reference method. Artificial neural networks (ANNs) have the potential to infer valuable process information already from relatively limited input data. However, in order to predict the concentration at complex conditions (many reactants and wide concentration ranges), larger ANNs and, therefore, a larger training dataset are required. We demonstrate that a moderately complex problem with four reactants can be addressed using ANNs in combination with the presented PAT method (low-field NMR) and with the proposed approach to generate meaningful training data. Graphical abstract.
Project description:Projection-reconstruction NMR (PR-NMR) has attracted growing attention as a method for collecting multidimensional NMR data rapidly. The PR-NMR procedure involves measuring lower-dimensional projections of a higher-dimensional spectrum, which are then used for the mathematical reconstruction of the full spectrum. We describe here the program PR-CALC, for the reconstruction of NMR spectra from projection data. This program implements a number of reconstruction algorithms, highly optimized to achieve maximal performance, and manages the reconstruction process automatically, producing either full spectra or subsets, such as regions or slices, as requested. The ability to obtain subsets allows large spectra to be analyzed by reconstructing and examining only those subsets containing peaks, offering considerable savings in processing time and storage space. PR-CALC is straightforward to use, and integrates directly into the conventional pipeline for data processing and analysis. It was written in standard C+ + and should run on any platform. The organization is flexible, and permits easy extension of capabilities, as well as reuse in new software. PR-CALC should facilitate the widespread utilization of PR-NMR in biomedical research.
Project description:Fragment-based drug discovery or FBDD is one of the main methods used by industry and academia for identifying drug-like candidates in early stages of drug discovery. NMR has a significant impact at any stage of the drug discovery process, from primary identification of small molecules to the elucidation of binding modes for guiding optimisations. The essence of NMR as an analytical tool, however, requires the processing and analysis of relatively large amounts of single data items, e.g. spectra, which can be daunting when managed manually. One bottleneck in FBDD by NMR is a lack of adequate and well-integrated resources for NMR data analysis that are freely available to the community. Thus, scientists typically resort to manually inspecting large datasets and relying predominantly on subjective interpretations. In this manuscript, we present CcpNmr AnalysisScreen, a software package that provides computational tools for automated analysis of FBDD data by NMR. We outline how the quality of collected spectra can be evaluated quickly, and how robust workflows can be optimised for reliable and rapid hit identification. With an intuitive graphical user interface and powerful algorithms, AnalysisScreen enables easy analysis of the large datasets needed in the early process of drug discovery by NMR.
Project description:Biomolecular NMR structures are now routinely used in biology, chemistry, and bioinformatics. Methods and metrics for assessing the accuracy and precision of protein NMR structures are beginning to be standardized across the biological NMR community. These include both knowledge-based assessment metrics, parameterized from the database of protein structures, and model versus data assessment metrics. On line servers are available that provide comprehensive protein structure quality assessment reports, and efforts are in progress by the world-wide Protein Data Bank (wwPDB) to develop a biomolecular NMR structure quality assessment pipeline as part of the structure deposition process. These quality assessment metrics and standards will aid NMR spectroscopists in determining more accurate structures, and increase the value and utility of these structures for the broad scientific community.