Memory-efficient RNA energy landscape exploration.
ABSTRACT: MOTIVATION: Energy landscapes provide a valuable means for studying the folding dynamics of short RNA molecules in detail by modeling all possible structures and their transitions. Higher abstraction levels based on a macro-state decomposition of the landscape enable the study of larger systems; however, they are still restricted by huge memory requirements of exact approaches. RESULTS: We present a highly parallelizable local enumeration scheme that enables the computation of exact macro-state transition models with highly reduced memory requirements. The approach is evaluated on RNA secondary structure landscapes using a gradient basin definition for macro-states. Furthermore, we demonstrate the need for exact transition models by comparing two barrier-based approaches, and perform a detailed investigation of gradient basins in RNA energy landscapes. AVAILABILITY AND IMPLEMENTATION: Source code is part of the C++ Energy Landscape Library available at http://www.bioinf.uni-freiburg.de/Software/.
Project description:MOTIVATION: During the last few years, several new small regulatory RNAs (sRNAs) have been discovered in bacteria. Most of them act as post-transcriptional regulators by base pairing to a target mRNA, causing translational repression or activation, or mRNA degradation. Numerous sRNAs have already been identified, but the number of experimentally verified targets is considerably lower. Consequently, computational target prediction is in great demand. Many existing target prediction programs neglect the accessibility of target sites and the existence of a seed, while other approaches are either specialized to certain types of RNAs or too slow for genome-wide searches. RESULTS: We introduce INTARNA, a new general and fast approach to the prediction of RNA-RNA interactions incorporating accessibility of target sites as well as the existence of a user-definable seed. We successfully applied INTARNA to the prediction of bacterial sRNA targets and determined the exact locations of the interactions with a higher accuracy than competing programs. AVAILABILITY: http://www.bioinf.uni-freiburg.de/Software/
Project description:We present GraphProt, a computational framework for learning sequence- and structure-binding preferences of RNA-binding proteins (RBPs) from high-throughput experimental data. We benchmark GraphProt, demonstrating that the modeled binding preferences conform to the literature, and showcase the biological relevance and two applications of GraphProt models. First, estimated binding affinities correlate with experimental measurements. Second, predicted Ago2 targets display higher levels of expression upon Ago2 knockdown, whereas control targets do not. Computational binding models, such as those provided by GraphProt, are essential for predicting RBP binding sites and affinities in all tissues. GraphProt is freely available at http://www.bioinf.uni-freiburg.de/Software/GraphProt.
Project description:MOTIVATION:State-of-the-art experimental data for determining binding specificities of peptide recognition modules (PRMs) is obtained by high-throughput approaches like peptide arrays. Most prediction tools applicable to this kind of data are based on an initial multiple alignment of the peptide ligands. Building an initial alignment can be error-prone, especially in the case of the proline-rich peptides bound by the SH3 domains. RESULTS:Here, we present a machine-learning approach based on an efficient graph-kernel technique to predict the specificity of a large set of 70 human SH3 domains, which are an important class of PRMs. The graph-kernel strategy allows us to (i) integrate several types of physico-chemical information for each amino acid, (ii) consider high-order correlations between these features and (iii) eliminate the need for an initial peptide alignment. We build specialized models for each human SH3 domain and achieve competitive predictive performance of 0.73 area under precision-recall curve, compared with 0.27 area under precision-recall curve for state-of-the-art methods based on position weight matrices. We show that better models can be obtained when we use information on the noninteracting peptides (negative examples), which is currently not used by the state-of-the art approaches based on position weight matrices. To this end, we analyze two strategies to identify subsets of high confidence negative data. The techniques introduced here are more general and hence can also be used for any other protein domains, which interact with short peptides (i.e. other PRMs). AVAILABILITY:The program with the predictive models can be found at http://www.bioinf.uni-freiburg.de/Software/SH3PepInt/SH3PepInt.tar.gz. We also provide a genome-wide prediction for all 70 human SH3 domains, which can be found under http://www.bioinf.uni-freiburg.de/Software/SH3PepInt/Genome-Wide-Predictions.tar.gz. SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.
Project description:Clustering RNA sequences with common secondary structure is an essential step towards studying RNA function. Whereas structural RNA alignment strategies typically identify common structure for orthologous structured RNAs, clustering seeks to group paralogous RNAs based on structural similarities. However, existing approaches for clustering paralogous RNAs, do not take the compensatory base pair changes obtained from structure conservation in orthologous sequences into account.Here, we present RNAscClust , the implementation of a new algorithm to cluster a set of structured RNAs taking their respective structural conservation into account. For a set of multiple structural alignments of RNA sequences, each containing a paralog sequence included in a structural alignment of its orthologs, RNAscClust computes minimum free-energy structures for each sequence using conserved base pairs as prior information for the folding. The paralogs are then clustered using a graph kernel-based strategy, which identifies common structural features. We show that the clustering accuracy clearly benefits from an increasing degree of compensatory base pair changes in the alignments.RNAscClust is available at http://www.bioinf.uni-freiburg.de/Software/RNAscClust .email@example.com or firstname.lastname@example.org.Supplementary data are available at Bioinformatics online.
Project description:MOTIVATION: Specific functions of ribonucleic acid (RNA) molecules are often associated with different motifs in the RNA structure. The key feature that forms such an RNA motif is the combination of sequence and structure properties. In this article, we introduce a new RNA sequence-structure comparison method which maintains exact matching substructures. Existing common substructures are treated as whole unit while variability is allowed between such structural motifs. Based on a fast detectable set of overlapping and crossing substructure matches for two nested RNA secondary structures, our method ExpaRNA (exact pattern of alignment of RNA) computes the longest collinear sequence of substructures common to two RNAs in O(H.nm) time and O(nm) space, where H << n.m for real RNA structures. Applied to different RNAs, our method correctly identifies sequence-structure similarities between two RNAs. RESULTS: We have compared ExpaRNA with two other alignment methods that work with given RNA structures, namely RNAforester and RNA_align. The results are in good agreement, but can be obtained in a fraction of running time, in particular for larger RNAs. We have also used ExpaRNA to speed up state-of-the-art Sankoff-style alignment tools like LocARNA, and observe a tradeoff between quality and speed. However, we get a speedup of 4.25 even in the highest quality setting, where the quality of the produced alignment is comparable to that of LocARNA alone. AVAILABILITY: The presented algorithm is implemented in the program ExpaRNA, which is available from our website (http://www.bioinf.uni-freiburg.de/Software).
Project description:Quantitatively understanding the robustness, adaptivity and efficiency of cell cycle dynamics under the influence of noise is a fundamental but difficult question to answer for most eukaryotic organisms. Using a simplified budding yeast cell cycle model perturbed by intrinsic noise, we systematically explore these issues from an energy landscape point of view by constructing an energy landscape for the considered system based on large deviation theory. Analysis shows that the cell cycle trajectory is sharply confined by the ambient energy barrier, and the landscape along this trajectory exhibits a generally flat shape. We explain the evolution of the system on this flat path by incorporating its non-gradient nature. Furthermore, we illustrate how this global landscape changes in response to external signals, observing a nice transformation of the landscapes as the excitable system approaches a limit cycle system when nutrients are sufficient, as well as the formation of additional energy wells when the DNA replication checkpoint is activated. By taking into account the finite volume effect, we find additional pits along the flat cycle path in the landscape associated with the checkpoint mechanism of the cell cycle. The difference between the landscapes induced by intrinsic and extrinsic noise is also discussed. In our opinion, this meticulous structure of the energy landscape for our simplified model is of general interest to other cell cycle dynamics, and the proposed methods can be applied to study similar biological systems.
Project description:Src homology 2 (SH2) domains are the largest family of the peptide-recognition modules (PRMs) that bind to phosphotyrosine containing peptides. Knowledge about binding partners of SH2-domains is key for a deeper understanding of different cellular processes. Given the high binding specificity of SH2, in-silico ligand peptide prediction is of great interest. Currently however, only a few approaches have been published for the prediction of SH2-peptide interactions. Their main shortcomings range from limited coverage, to restrictive modeling assumptions (they are mainly based on position specific scoring matrices and do not take into consideration complex amino acids inter-dependencies) and high computational complexity. We propose a simple yet effective machine learning approach for a large set of known human SH2 domains. We used comprehensive data from micro-array and peptide-array experiments on 51 human SH2 domains. In order to deal with the high data imbalance problem and the high signal-to-noise ration, we casted the problem in a semi-supervised setting. We report competitive predictive performance w.r.t. state-of-the-art. Specifically we obtain 0.83 AUC ROC and 0.93 AUC PR in comparison to 0.71 AUC ROC and 0.87 AUC PR previously achieved by the position specific scoring matrices (PSSMs) based SMALI approach. Our work provides three main contributions. First, we showed that better models can be obtained when the information on the non-interacting peptides (negative examples) is also used. Second, we improve performance when considering high order correlations between the ligand positions employing regularization techniques to effectively avoid overfitting issues. Third, we developed an approach to tackle the data imbalance problem using a semi-supervised strategy. Finally, we performed a genome-wide prediction of human SH2-peptide binding, uncovering several findings of biological relevance. We make our models and genome-wide predictions, for all the 51 SH2-domains, freely available to the scientific community under the following URLs: http://www.bioinf.uni-freiburg.de/Software/SH2PepInt/SH2PepInt.tar.gz and http://www.bioinf.uni-freiburg.de/Software/SH2PepInt/Genome-wide-predictions.tar.gz, respectively.
Project description:Intensification of agricultural landscapes represent a major threat for biodiversity conservation also affecting several ecosystem services. The natural and semi-natural remnants, available in the agricultural matrix, represent important sites for small mammals and rodents, which are fundamental for sustaining various ecosystem functions and trophic chains. We studied the populations of two small mammals (Apodemus agrarius, A. sylvaticus) to evaluate the effects of landscape and habitat features on species abundance along a gradient of agricultural landscape intensification. The study was performed in Friuli Venezia Giulia (north-eastern Italy) during 19 months, in 19 wood remnants. Species abundance was determined using Capture-Mark-Recapture (CMR) techniques. In the same plots, main ecological parameters of the habitat (at microhabitat and patch scale) and landscape were considered. Abundance of A. agrarius increased in landscapes with high extent of permanent crops (i.e., orchards and poplar plantations) and low content of undecomposed litter in the wood understory. Instead, A. sylvaticus, a more generalist species, showed an opposite, albeit less strong, relationship with the same variables. Both species were not affected by any landscape structural feature (e.g., patch shape, isolation). Our findings showed that microhabitat features and landscape composition rather than wood and landscape structure affect populations’ abundance and species interaction. The opposite response of the two study species was probably because of their specific ecological requirements. In this light, conservation management of agricultural landscapes should consider the ecological needs of species at both landscape and habitat levels, by rebalancing composition patterns in the context of ecological intensification, and promoting a sustainable forest patch management.
Project description:INFO-RNA is a new web server for designing RNA sequences that fold into a user given secondary structure. Furthermore, constraints on the sequence can be specified, e.g. one can restrict sequence positions to a fixed nucleotide or to a set of nucleotides. Moreover, the user can allow violations of the constraints at some positions, which can be advantageous in complicated cases. The INFO-RNA web server allows biologists to design RNA sequences in an automatic manner. It is clearly and intuitively arranged and easy to use. The procedure is fast, as most applications are completed within seconds and it proceeds better and faster than other existing tools. The INFO-RNA web server is freely available at http://www.bioinf.uni-freiburg.de/Software/INFO-RNA/
Project description:Identifying sets of metastable conformations is a major research topic in RNA energy landscape analysis, and recently several methods have been proposed for finding local minima in landscapes spawned by RNA secondary structures. An important and time-critical component of such methods is steepest, or gradient, descent in attraction basins of local minima. We analyse the speed-up achievable by randomised descent in attraction basins in the context of large sample sets where the size has an order of magnitude in the region of ~10(6). While the gain for each individual sample might be marginal, the overall run-time improvement can be significant. Moreover, for the two nongradient methods we analysed for partial energy landscapes induced by ten different RNA sequences, we obtained that the number of observed local minima is on average larger by 7.3% and 3.5%, respectively. The run-time improvement is approximately 16.6% and 6.8% on average over the ten partial energy landscapes. For the large sample size we selected for descent procedures, the coverage of local minima is very high up to energy values of the region where the samples were randomly selected from the partial energy landscapes; that is, the difference to the total set of local minima is mainly due to the upper area of the energy landscapes.