ABSTRACT: BACKGROUND: The center string (or closest string) problem is a classic computer science problem with important applications in computational biology. Given k input strings and a distance threshold d, we search for a string within Hamming distance at most d to each input string. This problem is NP complete. RESULTS: In this paper, we focus on exact methods for the problem that are also swift in application. We first introduce data reduction techniques that allow us to infer that certain instances have no solution, or that a center string must satisfy certain conditions. We describe how to use this information to speed up two previously published search tree algorithms. Then, we describe a novel iterative search strategy that is efficient in practice, where some of our reduction techniques can also be applied. Finally, we present results of an evaluation study for two different data sets from a biological application. CONCLUSIONS: We find that the running time for computing the optimal center string is dominated by the subroutine calls for d = dopt -1 and d = dopt. Our data reduction is very effective for both, either rejecting unsolvable instances or solving trivial positions. We find that this speeds up computations considerably.
Project description:Supertree methods enable the reconstruction of large phylogenies. The supertree problem can be formalized in different ways in order to cope with contradictory information in the input. Some supertree methods are based on encoding the input trees in a matrix; other methods try to find minimum cuts in some graph. Recently, we introduced Bad Clade Deletion (BCD) supertrees which combines the graph-based computation of minimum cuts with optimizing a global objective function on the matrix representation of the input trees. The BCD supertree method has guaranteed polynomial running time and is very swift in practice. The quality of reconstructed supertrees was superior to matrix representation with parsimony (MRP) and usually on par with SuperFine for simulated data; but particularly for biological data, quality of BCD supertrees could not keep up with SuperFine supertrees. Here, we present a beam search extension for the BCD algorithm that keeps alive a constant number of partial solutions in each top-down iteration phase. The guaranteed worst-case running time of the new algorithm is still polynomial in the size of the input. We present an exact and a randomized subroutine to generate suboptimal partial solutions. Both beam search approaches consistently improve supertree quality on all evaluated datasets when keeping 25 suboptimal solutions alive. Supertree quality of the BCD Beam Search algorithm is on par with MRP and SuperFine even for biological data. This is the best performance of a polynomial-time supertree algorithm reported so far.
Project description:When facing an unsolvable problem, dogs exhibit spontaneous human-oriented behaviours (e.g. looking at the human partner, gaze alternations between the human and the target) sooner and for longer than domestic cats and hand-raised wolves. These behaviours have been interpreted as interspecific communicative acts aimed to initiate interaction. Here, we compare the emergence of human-oriented behaviours (e.g. orientation towards humans, orientation alternations, vocalizations) in similarly raised family dogs and miniature pigs utilising an unsolvable task paradigm which consists of Baseline (no task), Solvable and Unsolvable phases. Relative to the Baseline phase in which both species showed human-oriented behaviours to a similar extent, during the Unsolvable phase dogs showed more and pigs showed less such behaviours. Species-predispositions in communicative behaviour may explain why dogs have a higher inclination than pigs to initiate interspecific interactions with humans in problem-solving contexts.
Project description:Due to our long history of living in close association with horses, these animals are suggested to have enhanced skills in understanding and communicating with humans. Today, horses have become important to humans for sport and leisure and their understanding of human behaviour and their human-directed behaviour are therefore of great importance. In this study, we investigated 22 horses in a human contact-seeking experiment where they were presented with an unsolvable problem and a detour experiment with a human demonstrator. The unsolvable problem consisted of pieces of carrot in a closed bucket and the detour resembled the shape of V. Additionally, personality traits of the participating horses were assessed. Interestingly, the full-sized horses (N?=?11) showed more human-related behaviours when presented with an unsolvable problem compared to before the carrots were made unreachable (p?=?0.033), while the ponies (N?=?11) did not. However, neither the full-sized horses nor the ponies were significantly more successful in the detour after human demonstrations than in control trials. When comparing the two experiments, we found the task-oriented behaviour in the detour test to positively correlate with human proximity and eye contact-seeking behaviour towards humans during the unsolvable problem in the contact-seeking test. Interestingly, again this was only true for the full-sized horses (p?<?0.05) and not for the ponies. From the horse personality questionnaire results, the traits excitability and anxiousness revealed strong negative correlations with human-directed behaviour in the contact-seeking experiment (p?<?0.05). Hence, size (full-sized horse/pony) and personality influenced the human-related behaviours of the horses and we suggest a future focus on these aspects to deepen our understanding of human-horse communication.
Project description:BACKGROUND:Given a peptide as a string of amino acids, the masses of all its prefixes and suffixes can be found by a trivial linear scan through the amino acid masses. The inverse problem is the idealde novopeptide sequencing problem: Given all prefix and suffix masses, determine the string of amino acids. In biological reality, the given masses are measured in a lab experiment, and measurements by necessity are noisy. The (real, noisy) de novo peptide sequencing problem therefore has a noisy input: a few of the prefix and suffix masses of the peptide are missing and a few other masses are given in addition. For this setting, we ask for an amino acid string that explains the given masses as accurately as possible. RESULTS:Past approaches interpreted accuracy by searching for a string that explains as many masses as possible. We feel, however, that it is not only bad to not explain a mass that appears, but also to explain a mass that does not appear. We propose to minimize the symmetric difference between the set of given masses and the set of masses that the string explains. For this new optimization problem, we propose an efficient algorithm that computes both the best and the k best solutions. Proof-of-concept experiments on measurements of synthesized peptides show that our approach leads to better results compared to finding a string that explains as many given masses as possible. CONCLUSIONS:We conclude that considering the symmetric difference as optimization goal can improve the identification rates for de novo peptide sequencing. A preliminary version of this work has been presented at WABI 2016.
Project description:Motif search is a fundamental problem in bioinformatics with an important application in locating transcription factor binding sites (TFBSs) in DNA sequences. The exact algorithms can report all (l, d) motifs and find the best one under a specific objective function. However, it is still a challenging task to identify weak motifs, since either a large amount of memory or execution time is required by current exact algorithms. A new exact algorithm, PairMotif, is proposed for planted (l, d) motif search (PMS) in this paper. To effectively reduce both candidate motifs and scanned l-mers, multiple pairs of l-mers with relatively large distances are selected from input sequences to restrict the search space. Comparisons with several recently proposed algorithms show that PairMotif requires less storage space and runs faster on most PMS instances. Particularly, among the algorithms compared, only PairMotif can solve the weak instance (27, 9) within 10 hours. Moreover, the performance of PairMotif is stable over the sequence length, which allows it to identify motifs in longer sequences. For the real biological data, experimental results demonstrate the validity of the proposed algorithm.
Project description:<h4>Background</h4>To infer a species phylogeny from unlinked genes, phylogenetic inference methods must confront the biological processes that create incongruence between gene trees and the species phylogeny. Intra-specific gene variation in ancestral species can result in deep coalescence, also known as incomplete lineage sorting, which creates incongruence between gene trees and the species tree. One approach to account for deep coalescence in phylogenetic analyses is the deep coalescence problem, which takes a collection of gene trees and seeks the species tree that implies the fewest deep coalescence events. Although this approach is promising for phylogenetics, the consensus properties of this problem are mostly unknown and analyses of large data sets may be computationally prohibitive.<h4>Results</h4>We prove that the deep coalescence consensus tree problem satisfies the highly desirable Pareto property for clusters (clades). That is, in all instances, each cluster that is present in all of the input gene trees, called a consensus cluster, will also be found in every optimal solution. Moreover, we introduce a new divide and conquer method for the deep coalescence problem based on the Pareto property. This method refines the strict consensus of the input gene trees, thereby, in practice, often greatly reducing the complexity of the tree search and guaranteeing that the estimated species tree will satisfy the Pareto property.<h4>Conclusions</h4>Analyses of both simulated and empirical data sets demonstrate that the divide and conquer method can greatly improve upon the speed of heuristics that do not consider the Pareto consensus property, while also guaranteeing that the proposed solution fulfills the Pareto property. The divide and conquer method extends the utility of the deep coalescence problem to data sets with enormous numbers of taxa.
Project description:Population pharmacokinetic analysis is used to estimate pharmacokinetic parameters and their variability from concentration data. Due to data sparseness issues, available datasets often do not allow the estimation of all parameters of the suitable model. The PRIOR subroutine in NONMEM supports the estimation of some or all parameters with values from previous models, as an alternative to fixing them or adding data to the dataset. From a literature review, the best practices were compiled to provide a practical guidance for the use of the PRIOR subroutine in NONMEM. Thirty-three articles reported the use of the PRIOR subroutine in NONMEM, mostly in special populations. This approach allowed fast, stable and satisfying modelling. The guidance provides general advice on how to select the most appropriate reference model when there are several previous models available, and to implement and weight the selected parameter values in the PRIOR function. On the model built with PRIOR, the similarity of estimates with the ones of the reference model and the sensitivity of the model to the PRIOR values should be checked. Covariates could be implemented a priori (from the reference model) or a posteriori, only on parameters estimated without prior (search for new covariates).
Project description:Computational methods attempting to identify instances of cis-regulatory modules (CRMs) in the genome face a challenging problem of searching for potentially interacting transcription factor binding sites while knowledge of the specific interactions involved remains limited. Without a comprehensive comparison of their performance, the reliability and accuracy of these tools remains unclear. Faced with a large number of different tools that address this problem, we summarized and categorized them based on search strategy and input data requirements. Twelve representative methods were chosen and applied to predict CRMs from the Drosophila CRM database REDfly, and across the human ENCODE regions. Our results show that the optimal choice of method varies depending on species and composition of the sequences in question. When discriminating CRMs from non-coding regions, those methods considering evolutionary conservation have a stronger predictive power than methods designed to be run on a single genome. Different CRM representations and search strategies rely on different CRM properties, and different methods can complement one another. For example, some favour homotypical clusters of binding sites, while others perform best on short CRMs. Furthermore, most methods appear to be sensitive to the composition and structure of the genome to which they are applied. We analyze the principal features that distinguish the methods that performed well, identify weaknesses leading to poor performance, and provide a guide for users. We also propose key considerations for the development and evaluation of future CRM-prediction methods.
Project description:<h4>Background</h4>The problem of finding a Shortest Common Supersequence (SCS) of a set of sequences is an important problem with applications in many areas. It is a key problem in biological sequences analysis. The SCS problem is well-known to be NP-complete. Many heuristic algorithms have been proposed. Some heuristics work well on a few long sequences (as in sequence comparison applications); others work well on many short sequences (as in oligo-array synthesis). Unfortunately, most do not work well on large SCS instances where there are many, long sequences.<h4>Results</h4>In this paper, we present a Deposition and Reduction (DR) algorithm for solving large SCS instances of biological sequences. There are two processes in our DR algorithm: deposition process, and reduction process. The deposition process is responsible for generating a small set of common supersequences; and the reduction process shortens these common supersequences by removing some characters while preserving the common supersequence property. Our evaluation on simulated data and real DNA and protein sequences show that our algorithm consistently produces the best results compared to many well-known heuristic algorithms, and especially on large instances.<h4>Conclusion</h4>Our DR algorithm provides a partial answer to the open problem of designing efficient heuristic algorithm for SCS problem on many long sequences. Our algorithm has a bounded approximation ratio. The algorithm is efficient, both in running time and space complexity and our evaluation shows that it is practical even for SCS problems on many long sequences.
Project description:Continuous time quantum walks provide an important framework for designing new algorithms and modelling quantum transport and state transfer problems. Often, the graph representing the structure of a problem contains certain symmetries that confine the dynamics to a smaller subspace of the full Hilbert space. In this work, we use invariant subspace methods, that can be computed systematically using the Lanczos algorithm, to obtain the reduced set of states that encompass the dynamics of the problem at hand without the specific knowledge of underlying symmetries. First, we apply this method to obtain new instances of graphs where the spatial quantum search algorithm is optimal: complete graphs with broken links and complete bipartite graphs, in particular, the star graph. These examples show that regularity and high-connectivity are not needed to achieve optimal spatial search. We also show that this method considerably simplifies the calculation of quantum transport efficiencies. Furthermore, we observe improved efficiencies by removing a few links from highly symmetric graphs. Finally, we show that this reduction method also allows us to obtain an upper bound for the fidelity of a single qubit transfer on an XY spin network.