Project description:NMR-I-TASSER, an adaption of the I-TASSER algorithm combining NMR data for protein structure determination, recently joined the second round of the CASD-NMR experiment. Unlike many molecular dynamics-based methods, NMR-I-TASSER takes a molecular replacement-like approach to the problem by first threading the target through the PDB to identify structural templates which are then used for iterative NOE assignments and fragment structure assembly refinements. The employment of multiple templates allows NMR-I-TASSER to sample different topologies while convergence to a single structure is not required. Retroactive and blind tests of the CASD-NMR targets from Rounds 1 and 2 demonstrate that even without using NOE peak lists I-TASSER can generate correct structure topology with 15 of 20 targets having a TM-score above 0.5. With the addition of NOE-based distance restraints, NMR-I-TASSER significantly improved the I-TASSER models with all models having the TM-score above 0.5. The average RMSD was reduced from 5.29 to 2.14 Å in Round 1 and 3.18 to 1.71 Å in Round 2. There is no obvious difference in the modeling results with using raw and refined peak lists, indicating robustness of the pipeline to the NOE assignment errors. Overall, despite the low-resolution modeling the current NMR-I-TASSER pipeline provides a coarse-grained structure folding approach complementary to traditional molecular dynamics simulations, which can produce fast near-native frameworks for atomic-level structural refinement.
Project description:Gene expression and SNPs data hold great potential for a new understanding of disease prognosis, drug sensitivity, and toxicity evaluations. Cluster analysis is used to analyze data that do not contain any specific subgroups. The goal is to use the data itself to recognize meaningful and informative subgroups. In addition, cluster investigation helps data reduction purposes, exposes hidden patterns, and generates hypotheses regarding the relationship between genes and phenotypes. Cluster analysis could also be used to identify bio-markers and yield computational predictive models. The methods used to analyze microarrays data can profoundly influence the interpretation of the results. Therefore, a basic understanding of these computational tools is necessary for optimal experimental design and meaningful data analysis. This manuscript provides an analysis protocol to effectively analyze gene expression data sets through the K-means and DBSCAN algorithms. The general protocol enables analyzing omics data to identify subsets of features with low redundancy and high robustness, speeding up the identification of new bio-markers through pathway enrichment analysis. In addition, to demonstrate the effectiveness of our clustering analysis protocol, we analyze a real data set from the GEO database. Finally, the manuscript provides some best practice and tips to overcome some issues in the analysis of omics data sets through unsupervised learning.
Project description:Identifying interesting relationships between pairs of variables in large data sets is increasingly important. Here, we present a measure of dependence for two-variable relationships: the maximal information coefficient (MIC). MIC captures a wide range of associations both functional and not, and for functional relationships provides a score that roughly equals the coefficient of determination (R(2)) of the data relative to the regression function. MIC belongs to a larger class of maximal information-based nonparametric exploration (MINE) statistics for identifying and classifying relationships. We apply MIC and MINE to data sets in global health, gene expression, major-league baseball, and the human gut microbiota and identify known and novel relationships.
Project description:BackgroundMulti-dimensional scaling (MDS) is aimed to represent high dimensional data in a low dimensional space with preservation of the similarities between data points. This reduction in dimensionality is crucial for analyzing and revealing the genuine structure hidden in the data. For noisy data, dimension reduction can effectively reduce the effect of noise on the embedded structure. For large data set, dimension reduction can effectively reduce information retrieval complexity. Thus, MDS techniques are used in many applications of data mining and gene network research. However, although there have been a number of studies that applied MDS techniques to genomics research, the number of analyzed data points was restricted by the high computational complexity of MDS. In general, a non-metric MDS method is faster than a metric MDS, but it does not preserve the true relationships. The computational complexity of most metric MDS methods is over O(N2), so that it is difficult to process a data set of a large number of genes N, such as in the case of whole genome microarray data.ResultsWe developed a new rapid metric MDS method with a low computational complexity, making metric MDS applicable for large data sets. Computer simulation showed that the new method of split-and-combine MDS (SC-MDS) is fast, accurate and efficient. Our empirical studies using microarray data on the yeast cell cycle showed that the performance of K-means in the reduced dimensional space is similar to or slightly better than that of K-means in the original space, but about three times faster to obtain the clustering results. Our clustering results using SC-MDS are more stable than those in the original space. Hence, the proposed SC-MDS is useful for analyzing whole genome data.ConclusionOur new method reduces the computational complexity from O(N3) to O(N) when the dimension of the feature space is far less than the number of genes N, and it successfully reconstructs the low dimensional representation as does the classical MDS. Its performance depends on the grouping method and the minimal number of the intersection points between groups. Feasible methods for grouping methods are suggested; each group must contain both neighboring and far apart data points. Our method can represent high dimensional large data set in a low dimensional space not only efficiently but also effectively.
Project description:Large-scale genetic and genomic data are increasingly available and the major bottleneck in their analysis is a lack of sufficiently scalable computational tools. To address this problem in the context of complex traits analysis, we present DISSECT. DISSECT is a new and freely available software that is able to exploit the distributed-memory parallel computational architectures of compute clusters, to perform a wide range of genomic and epidemiologic analyses, which currently can only be carried out on reduced sample sizes or under restricted conditions. We demonstrate the usefulness of our new tool by addressing the challenge of predicting phenotypes from genotype data in human populations using mixed-linear model analysis. We analyse simulated traits from 470,000 individuals genotyped for 590,004 SNPs in ∼4 h using the combined computational power of 8,400 processor cores. We find that prediction accuracies in excess of 80% of the theoretical maximum could be achieved with large sample sizes.
Project description:Matched molecular pairs (MMPs) are widely used in medicinal chemistry to study changes in compound properties including biological activity, which are associated with well-defined structural modifications. Herein we describe up-to-date versions of three MMP-based data sets that have originated from in-house research projects. These data sets include activity cliffs, structure-activity relationship (SAR) transfer series, and second generation MMPs based upon retrosynthetic rules. The data sets have in common that they have been derived from compounds included in the ChEMBL database (release 17) for which high-confidence activity data are available. Thus, the activity data associated with MMP-based activity cliffs, SAR transfer series, and retrosynthetic MMPs cover the entire spectrum of current pharmaceutical targets. Our data sets are made freely available to the scientific community.
Project description:Nmrglue, an open source Python package for working with multidimensional NMR data, is described. When used in combination with other Python scientific libraries, nmrglue provides a highly flexible and robust environment for spectral processing, analysis and visualization and includes a number of common utilities such as linear prediction, peak picking and lineshape fitting. The package also enables existing NMR software programs to be readily tied together, currently facilitating the reading, writing and conversion of data stored in Bruker, Agilent/Varian, NMRPipe, Sparky, SIMPSON, and Rowland NMR Toolkit file formats. In addition to standard applications, the versatility offered by nmrglue makes the package particularly suitable for tasks that include manipulating raw spectrometer data files, automated quantitative analysis of multidimensional NMR spectra with irregular lineshapes such as those frequently encountered in the context of biomacromolecular solid-state NMR, and rapid implementation and development of unconventional data processing methods such as covariance NMR and other non-Fourier approaches. Detailed documentation, install files and source code for nmrglue are freely available at http://nmrglue.com. The source code can be redistributed and modified under the New BSD license.
Project description:BackgroundClustering of protein sequences is of key importance in predicting the structure and function of newly sequenced proteins and is also of use for their annotation. With the advent of multiple high-throughput sequencing technologies, new protein sequences are becoming available at an extraordinary rate. The rapid growth rate has impeded deployment of existing protein clustering/annotation tools which depend largely on pairwise sequence alignment.ResultsIn this paper, we propose an alignment-free clustering approach, coreClust, for annotating protein sequences using detected conserved regions. The proposed algorithm uses Min-Wise Independent Hashing for identifying similar conserved regions. Min-Wise Independent Hashing works by generating a (w,c)-sketch for each document and comparing these sketches. Our algorithm fits well within the MapReduce framework, permitting scalability. We show that coreClust generates results comparable to existing known methods. In particular, we show that the clusters generated by our algorithm capture the subfamilies of the Pfam domain families for which the sequences in a cluster have a similar domain architecture. We show that for a data set of 90,000 sequences (about 250,000 domain regions), the clusters generated by our algorithm give a 75% average weighted F1 score, our accuracy metric, when compared to the clusters generated by a semi-exhaustive pairwise alignment algorithm.ConclusionsThe new clustering algorithm can be used to generate meaningful clusters of conserved regions. It is a scalable method that when paired with our prior work, NADDA for detecting conserved regions, provides a complete end-to-end pipeline for annotating protein sequences.
Project description:This paper presents data of medical choices determining physicians' profit and patients' health benefit under three levels of market competition: monopoly, duopoly, and quadropoly. The data was collected from 136 German university students in an incentivized laboratory experiment. The designed experimental parameters and the formula for computing the payoff matrices of the games are described in this paper as well. This data was analyzed by Ge and Godager [5] who employed quantal response equilibrium choice models to investigate the relationship between market competition and determinism in behavior under a quantal response equilibrium paradigm. This data contributes to future investigation on alternative game theoretic equilibrium concepts and the development of empirical methods for studying strategic choice behavior.