Project description:Modern high-throughput screening methods allow researchers to generate large datasets that potentially contain important biological information. However, oftentimes, picking relevant hits from such screens and generating testable hypotheses requires training in bioinformatics and the skills to efficiently perform database mining. There are currently no tools available to general public that allow users to cross-reference their screen datasets with published screen datasets. To this end, we developed CrossCheck, an online platform for high-throughput screen data analysis. CrossCheck is a centralized database that allows effortless comparison of the user-entered list of gene symbols with 16,231 published datasets. These datasets include published data from genome-wide RNAi and CRISPR screens, interactome proteomics and phosphoproteomics screens, cancer mutation databases, low-throughput studies of major cell signaling mediators, such as kinases, E3 ubiquitin ligases and phosphatases, and gene ontological information. Moreover, CrossCheck includes a novel database of predicted protein kinase substrates, which was developed using proteome-wide consensus motif searches. CrossCheck dramatically simplifies high-throughput screen data analysis and enables researchers to dig deep into the published literature and streamline data-driven hypothesis generation. CrossCheck is freely accessible as a web-based application at http://proteinguru.com/crosscheck.
Project description:BackgroundIncreases in physical activity through active travel have the potential to have large beneficial effects on populations, through both better health outcomes and reduced motorized traffic. However accurately identifying travel mode in large datasets is problematic. Here we provide an open source tool to quantify time spent stationary and in four travel modes(walking, cycling, train, motorised vehicle) from accelerometer measured physical activity data, combined with GPS and GIS data.MethodsThe Examining Neighbourhood Activities in Built Living Environments in London study evaluates the effect of the built environment on health behaviours, including physical activity. Participants wore accelerometers and GPS receivers on the hip for 7 days. We time-matched accelerometer and GPS, and then extracted data from the commutes of 326 adult participants, using stated commute times and modes, which were manually checked to confirm stated travel mode. This yielded examples of five travel modes: walking, cycling, motorised vehicle, train and stationary. We used this example data to train a gradient boosted tree, a form of supervised machine learning algorithm, on each data point (131,537 points), rather than on journeys. Accuracy during training was assessed using five-fold cross-validation. We also manually identified the travel behaviour of both 21 participants from ENABLE London (402,749 points), and 10 participants from a separate study (STAMP-2, 210,936 points), who were not included in the training data. We compared our predictions against this manual identification to further test accuracy and test generalisability.ResultsApplying the algorithm, we correctly identified travel mode 97.3% of the time in cross-validation (mean sensitivity 96.3%, mean active travel sensitivity 94.6%). We showed 96.0% agreement between manual identification and prediction of 21 individuals' travel modes (mean sensitivity 92.3%, mean active travel sensitivity 84.9%) and 96.5% agreement between the STAMP-2 study and predictions (mean sensitivity 85.5%, mean active travel sensitivity 78.9%).ConclusionWe present a generalizable tool that identifies time spent stationary and time spent walking with very high precision, time spent in trains or vehicles with good precision, and time spent cycling with moderate precisionIn studies where both accelerometer and GPS data are available this tool complements analyses of physical activity, showing whether differences in PA may be explained by differences in travel mode. All code necessary to replicate, fit and predict to other datasets is provided to facilitate use by other researchers.
Project description:Meaningful efforts in computer-aided drug design (CADD) require accurate molecular mechanical force fields to quantitatively characterize protein-ligand interactions, ligand hydration free energies, and other ligand physical properties. Atomic models of new compounds are commonly generated by analogy from the predefined tabulated parameters of a given force field. Two widely used approaches following this strategy are the General Amber Force Field (GAFF) and the CHARMM General Force Field (CGenFF). An important limitation of using pretabulated parameter values is that they may be inadequate in the context of a specific molecule. To resolve this issue, we previously introduced the General Automated Atomic Model Parameterization (GAAMP) for automatically generating the parameters of atomic models of small molecules, using the results from ab initio quantum mechanical (QM) calculations as target data. The GAAMP protocol uses QM data to optimize the bond, valence angle, and dihedral angle internal parameters, and atomic partial charges. However, since the treatment of van der Waals interactions based on QM is challenging and may often be unreliable, the Lennard-Jones 6-12 parameters are kept unchanged from the initial atom types assignments (GAFF or CGenFF), which limits the accuracy that can be achieved by these models. To address this issue, a new set of Lennard-Jones 6-12 parameters was systematically optimized to reproduce experimental neat liquid densities and enthalpies of vaporization for a large set of 430 compounds, covering a wide range of chemical functionalities. Calculations of the hydration free energy indicate that optimal accuracy for these models is achieved when the molecule-water van der Waals dispersion is rescaled by a factor of 1.115. The final optimized model yields an average unsigned error of 0.79 kcal/mol in the hydration free energies.
Project description:Comparison of ligand poses generated by protein-ligand docking programs has often been carried out with the assumption of direct atomic correspondence between ligand structures. However, this correspondence is not necessarily chemically relevant for symmetric molecules and can lead to an artificial inflation of ligand pose distance metrics, particularly those that depend on receptor superposition (rather than ligand superposition), such as docking root mean square deviation (RMSD). Several of the commonly-used RMSD calculation algorithms that correct for molecular symmetry do not take into account the bonding structure of molecules and can therefore result in non-physical atomic mapping. Here, we present DockRMSD, a docking pose distance calculator that converts the symmetry correction to a graph isomorphism searching problem, in which the optimal atomic mapping and RMSD calculation are performed by an exhaustive and fast matching search of all isomorphisms of the ligand structure graph. We show through evaluation of docking poses generated by AutoDock Vina on the CSAR Hi-Q set that DockRMSD is capable of deterministically identifying the minimum symmetry-corrected RMSD and is able to do so without significant loss of computational efficiency compared to other methods. The open-source DockRMSD program can be conveniently integrated with various docking pipelines to assist with accurate atomic mapping and RMSD calculations, which can therefore help improve docking performance, especially for ligand molecules with complicated structural symmetry.
Project description:Passive acoustic monitoring is used widely in ecology, biodiversity, and conservation studies. Data sets collected via acoustic monitoring are often extremely large and built to be processed automatically using artificial intelligence and machine learning models, which aim to replicate the work of domain experts. These models, being supervised learning algorithms, need to be trained on high quality annotations produced by experts. Since the experts are often resource-limited, a cost-effective process for annotating audio is needed to get maximal use out of the data. We present an open-source interactive audio data annotation tool, NEAL (Nature+Energy Audio Labeller). Built using R and the associated Shiny framework, the tool provides a reactive environment where users can quickly annotate audio files and adjust settings that automatically change the corresponding elements of the user interface. The app has been designed with the goal of having both expert birders and citizen scientists contribute to acoustic annotation projects. The popularity and flexibility of R programming in bioacoustics means that the Shiny app can be modified for other bird labelling data sets, or even to generic audio labelling tasks. We demonstrate the app by labelling data collected from wind farm sites across Ireland.
Project description:Our understanding of complex living systems is limited by our capacity to perform experiments in high throughput. While robotic systems have automated many traditional hand-pipetting protocols, software limitations have precluded more advanced maneuvers required to manipulate, maintain, and monitor hundreds of experiments in parallel. Here, we present Pyhamilton, an open-source Python platform that can execute complex pipetting patterns required for custom high-throughput experiments such as the simulation of metapopulation dynamics. With an integrated plate reader, we maintain nearly 500 remotely monitored bacterial cultures in log-phase growth for days without user intervention by taking regular density measurements to adjust the robotic method in real-time. Using these capabilities, we systematically optimize bioreactor protein production by monitoring the fluorescent protein expression and growth rates of a hundred different continuous culture conditions in triplicate to comprehensively sample the carbon, nitrogen, and phosphorus fitness landscape. Our results demonstrate that flexible software can empower existing hardware to enable new types and scales of experiments, empowering areas from biomanufacturing to fundamental biology.
Project description:BackgroundHigh throughput sequencing technologies have been increasingly used in basic genetic research as well as in clinical applications. More and more variants underlying Mendelian and complex diseases are being discovered and documented using these technologies. However, identifying and obtaining a short list of candidate disease-causing variants remains challenging for most of the users after variant calling, especially for people without computational skills.ResultsWe developed GenESysV (Genome Exploration System for Variants) as a scalable, intuitive and user-friendly open source tool. It can be used in any high throughput sequencing or genotyping project for storing, managing, prioritizing and efficient retrieval of variants of interest. GenESysV is designed for use by researchers from a wide range of disciplines and computational skills, including wet-lab scientists, clinicians, and bioinformaticians.ConclusionsGenESysV is the first tool to be able to handle genomic variant dataset ranging in size from a few to thousands of samples and still maintain fast data importation and good query performance. It has a very intuitive graphical user interface and can also be used in studies where secured data access is an important concern. We believe this tool will benefit the human disease research community to speed up discoveries for genetic variants underlying human genetic disorders.
Project description:The IDAT file format is used to store BeadArray data from the myriad of genomewide profiling platforms on offer from Illumina Inc. This proprietary format is output directly from the scanner and stores summary intensities for each probe-type on an array in a compact manner. A lack of open source tools to process IDAT files has hampered their uptake by the research community beyond the standard step of using the vendor's software to extract the data they contain in a human readable text format. To fill this void, we have developed the illuminaio package that parses IDAT files from any BeadArray platform, including the decryption of files from Illumina's gene expression arrays. illuminaio provides the first open-source package for this task, and will promote wider uptake of the IDAT format as a standard for sharing Illumina BeadArray data in public databases, in the same way that the CEL file serves as the standard for the Affymetrix platform.
Project description:Microfluidic devices are an enabling technology for many labs, facilitating a wide range of applications spanning high-throughput encapsulation, molecular separations, and long-term cell culture. In many cases, however, their utility is limited by a 'world-to-chip' barrier that makes it difficult to serially interface samples with these devices. As a result, many researchers are forced to rely on low-throughput, manual approaches for managing device input and output (IO) of samples, reagents, and effluent. Here, we present a hardware-software platform for automated microfluidic IO (micrIO). The platform, which is uniquely compatible with positive-pressure microfluidics, comprises an 'AutoSipper' for input and a 'Fraction Collector' for output. To facilitate widespread adoption, both are open-source builds constructed from components that are readily purchased online or fabricated from included design files. The software control library, written in Python, allows the platform to be integrated with existing experimental setups and to coordinate IO with other functions such as valve actuation and assay imaging. We demonstrate these capabilities by coupling both the AutoSipper and Fraction Collector to two microfluidic devices: a simple, valved inlet manifold and a microfluidic droplet generator that produces beads with distinct spectral codes. Analysis of the collected materials in each case establishes the ability of the platform to draw from and output to specific wells of multiwell plates with negligible cross-contamination between samples.
Project description:Ligand-based virtual screening is a widespread method in modern drug design. It allows for a rapid screening of large compound databases in order to identify similar structures. Here we report an open-source command line tool which includes a substructure-, fingerprint- and shape-based virtual screening. Most of the implemented features fully rely on the RDKit cheminformatics framework. VSFlow accepts a wide range of input file formats and is highly customizable. Additionally, a quick visualization of the screening results as pdf and/or pymol file is supported.