Project description:Copy number variation (CNV) is a well-known type of genomic mutation that is associated with the development of human cancer diseases. Detection of CNVs from the human genome is a crucial step for the pipeline of starting from mutation analysis to cancer disease diagnosis and treatment. Next-generation sequencing (NGS) data provides an unprecedented opportunity for CNVs detection at the base-level resolution, and currently, many methods have been developed for CNVs detection using NGS data. However, due to the intrinsic complexity of CNVs structures and NGS data itself, accurate detection of CNVs still faces many challenges. In this paper, we present an alternative method, called KNNCNV (K-Nearest Neighbor based CNV detection), for the detection of CNVs using NGS data. Compared to current methods, KNNCNV has several distinctive features: 1) it assigns an outlier score to each genome segment based solely on its first k nearest-neighbor distances, which is not only easy to extend to other data types but also improves the power of discovering CNVs, especially the local CNVs that are likely to be masked by their surrounding regions; 2) it employs the variational Bayesian Gaussian mixture model (VBGMM) to transform these scores into a series of binary labels without a user-defined threshold. To evaluate the performance of KNNCNV, we conduct both simulation and real sequencing data experiments and make comparisons with peer methods. The experimental results show that KNNCNV could derive better performance than others in terms of F1-score.
Project description:Cooperativity is a hallmark of protein folding, but the thermodynamic origins of cooperativity are difficult to quantify. Tandem repeat proteins provide a unique experimental system to quantify cooperativity due to their internal symmetry and their tolerance of deletion, extension, and in some cases fragmentation into single repeats. Analysis of repeat proteins of different lengths with nearest-neighbor Ising models provides values for repeat folding ([Formula: see text]) and inter-repeat coupling (ΔGi-1,i). In this article, we review the architecture of repeat proteins and classify them in terms of ΔGi and ΔGi-1,i; this classification scheme groups repeat proteins according to their degree of cooperativity. We then present various statistical thermodynamic models, based on the 1D-Ising model, for analysis of different classes of repeat proteins. We use these models to analyze data for highly and moderately cooperative and noncooperative repeat proteins and relate their fitted parameters to overall structural features.
Project description:RationaleAdvanced algorithmic solutions are necessary to process the ever-increasing amounts of mass spectrometry data that are being generated. In this study, we describe the falcon spectrum clustering tool for efficient clustering of millions of MS/MS spectra.Methodsfalcon succeeds in efficiently clustering large amounts of mass spectral data using advanced techniques for fast spectrum similarity searching. First, high-resolution spectra are binned and converted to low-dimensional vectors using feature hashing. Next, the spectrum vectors are used to construct nearest neighbor indexes for fast similarity searching. The nearest neighbor indexes are used to efficiently compute a sparse pairwise distance matrix without having to exhaustively perform all pairwise spectrum comparisons within the relevant precursor mass tolerance. Finally, density-based clustering is performed to group similar spectra into clusters.ResultsSeveral state-of-the-art spectrum clustering tools were evaluated using a large draft human proteome data set consisting of 25 million spectra, indicating that alternative tools produce clustering results with different characteristics. Notably, falcon generates larger highly pure clusters than alternative tools, leading to a larger reduction in data volume without the loss of relevant information for more efficient downstream processing.Conclusionsfalcon is a highly efficient spectrum clustering tool, which is publicly available as an open source under the permissive BSD license at https://github.com/bittremieux/falcon.
Project description:In this paper, we propose a framework for efficiently accelerating Nearest Neighbor Particle (NNP) calculations in a movable particle-based system by leveraging the dynamic changes in disk sectors. The NNP region based on particles and disk sectors is determined by the following three conditions: 1) The position of the disk resides within the range of neighbor particles. 2) The position of a neighbor particle exists within a disk sector. 3) A neighbor particle exists between the two vectors that form the disk sector. When all of these conditions are satisfied, we assume that there is a particle within the disk sector. In this paper, we automatically update the inspection range of NNP, which is the disk sector, based on the movement of particles. To calculate the dynamic changes in the disk sector, we control the direction, length, and angle of the disk based on the positions and velocities of particles. Ultimately, we accelerate the computation of NNP by utilizing the particles located within the calculated disk sector. The proposed acceleration method can be implemented simply, as it operates on the particles within the disk sector using closed-form expressions, without the explicit data structures like trees. Especially in the case of movable particles, unlike the conventional adaptive tree approach that requires continuous data structure updates, the proposed method can be efficiently utilized in applications requiring NNP. This is because it rapidly calculates collision areas using closed-form expressions that are adjusted according to the particles' motion. Our method yielded results that were 2 to 20 times faster compared to Hash tables or K-d trees in experiments conducted across diverse scenes. Furthermore, its scalability was demonstrated through its application in various scenarios (particle-based fluids, splash and foam, isoline tracking, turbulent flow, collision handling).
Project description:The linear "Ising" model, which has been around for nearly a century, treats the behavior of linear arrays of repetitive, interacting subunits. Linear "repeat-proteins" have only been described in the last decade or so, and their folding energies have only been characterized very recently. Owing to their repetitive structures, linear repeat-proteins are particularly well suited for analysis by the nearest-neighbor Ising formalism. After briefly describing the historical origins and applications of the Ising model to biopolymers, and introducing repeat protein structure, this chapter will focus on the application of the linear Ising model to repeat proteins. When applied to homopolymers, the model can be represented and applied in a fairly simplified form. When applied to heteropolymers, where differences in energies among individual subunits (i.e. repeats) must be included, some (but not all) of this simplicity is lost. Derivations of the linear Ising model for both homopolymer and heteropolymer repeat-proteins will be presented. With the increased complexity required for analysis of heteropolymeric repeat proteins, the ability to resolve different energy terms from experimental data can be compromised. Thus, a simple matrix approach will be developed to help inform on the degree to which different thermodynamic parameters can be extracted from a particular set of unfolding curves. Finally, we will describe the application of these models to analyze repeat-protein folding equilibria, focusing on simplified repeat proteins based on "consensus" sequence information.
Project description:MotivationFolding during transcription can have an important influence on the structure and function of RNA molecules, as regions closer to the 5' end can fold into metastable structures before potentially stronger interactions with the 3' end become available. Thermodynamic RNA folding models are not suitable to predict structures that result from cotranscriptional folding, as they can only calculate properties of the equilibrium distribution. Other software packages that simulate the kinetic process of RNA folding during transcription exist, but they are mostly applicable for short sequences.ResultsWe present a new algorithm that tracks changes to the RNA secondary structure ensemble during transcription. At every transcription step, new representative local minima are identified, a neighborhood relation is defined and transition rates are estimated for kinetic simulations. After every simulation, a part of the ensemble is removed and the remainder is used to search for new representative structures. The presented algorithm is deterministic (up to numeric instabilities of simulations), fast (in comparison with existing methods), and it is capable of folding RNAs much longer than 200 nucleotides.Availability and implementationThis software is open-source and available at https://github.com/ViennaRNA/drtransformer.Supplementary informationSupplementary data are available at Bioinformatics online.
Project description:In this data study, assessment of daily rainfall nearest neighbor׳s patterns (DRBBP) was described in Iran. This article presents some spatial patterns of daily rainfall nearest neighbor׳s patterns for Iran from 170 stations and 31195 rainfall points by comparing ordinary kriging techniques based on the forecast models. For the nearest neighbor׳s patterns of the daily rainfall, rainfall data series of 1975-2014 was employed to estimate the point data of daily rainfall. The statistical properties were analyzed to indicate an increase in dispersed variability patterns of daily rainfall in Iran. Dispersed patterns were selected as the best nearest neighbor׳s models to model daily rainfall variability. The data results will help climatologists and hydrologists in model assessment and planning of natural environment in Iran.
Project description:By replacing lost or dysfunctional myocardium, tissue regeneration is a promising approach to treat heart failure. However, the challenge of detecting bona fide heart regeneration limits the validation of potential regenerative factors. One method to detect new cardiomyocytes is multicolor lineage tracing with clonal analysis. Clonal analysis experiments can be difficult to undertake, because labeling conditions that are too sparse lack sensitivity for rare events such as cardiomyocyte proliferation, and diffuse labeling limits the ability to resolve clones. Presented here is a protocol to undertake clonal analysis of the neonatal mouse heart by using statistical modeling of nearest neighbor distributions to resolve cardiomyocyte clones. This approach enables resolution of clones over a range of labeling conditions and provides a robust analytical approach for quantifying cardiomyocyte proliferation and regeneration. This protocol can be adapted to other tissues and can be broadly used to study tissue regeneration.
Project description:We propose a novel local nearest neighbor distance (LNND) descriptor for anomaly detection in crowded scenes. Comparing with the commonly used low-level feature descriptors in previous works, LNND descriptor has two major advantages. First, LNND descriptor efficiently incorporates spatial and temporal contextual information around the video event that is important for detecting anomalous interaction among multiple events, while most existing feature descriptors only contain the information of single event. Second, LNND descriptor is a compact representation and its dimensionality is typically much lower than the low-level feature descriptor. Therefore, not only the computation time and storage requirement can be accordingly saved by using LNND descriptor for the anomaly detection method with offline training fashion, but also the negative aspects caused by using high-dimensional feature descriptor can be avoided. We validate the effectiveness of LNND descriptor by conducting extensive experiments on different benchmark datasets. Experimental results show the promising performance of LNND-based method against the state-of-the-art methods. It is worthwhile to notice that the LNND-based approach requires less intermediate processing steps without any subsequent processing such as smoothing but achieves comparable event better performance.
Project description:We consider two types of spatial symmetry, namely, symmetry in the mixed or shared nearest neighbor (NN) structures. We use Pielou's and Dixon's symmetry tests which are defined using contingency tables based on the NN relationships between the data points. We generalize these tests to multiple classes and demonstrate that both the asymptotic and exact versions of Pielou's first type of symmetry test are extremely conservative in rejecting symmetry in the mixed NN structure and hence should be avoided or only the Monte Carlo randomized version should be used. Under RL, we derive the asymptotic distribution for Dixon's symmetry test and also observe that the usual independence test seems to be appropriate for Pielou's second type of test. Moreover, we apply variants of Fisher's exact test on the shared NN contingency table for Pielou's second test and determine the most appropriate version for our setting. We also consider pairwise and one-versus-rest type tests in post hoc analysis after a significant overall symmetry test. We investigate the asymptotic properties of the tests, prove their consistency under appropriate null hypotheses, and investigate finite sample performance of them by extensive Monte Carlo simulations. The methods are illustrated on a real-life ecological data set.