Uncovering and testing the fuzzy clusters based on lumped Markov chain in complex network.
ABSTRACT: Identifying clusters, namely groups of nodes with comparatively strong internal connectivity, is a fundamental task for deeply understanding the structure and function of a network. By means of a lumped Markov chain model of a random walker, we propose two novel ways of inferring the lumped markov transition matrix. Furthermore, some useful results are proposed based on the analysis of the properties of the lumped Markov process. To find the best partition of complex networks, a novel framework including two algorithms for network partition based on the optimal lumped Markovian dynamics is derived to solve this problem. The algorithms are constructed to minimize the objective function under this framework. It is demonstrated by the simulation experiments that our algorithms can efficiently determine the probabilities with which a node belongs to different clusters during the learning process and naturally supports the fuzzy partition. Moreover, they are successfully applied to real-world network, including the social interactions between members of a karate club.
Project description:Identifying communities (or clusters), namely groups of nodes with comparatively strong internal connectivity, is a fundamental task for deeply understanding the structure and function of a network. Yet, there is a lack of formal criteria for defining communities and for testing their significance. We propose a sharp definition that is based on a quality threshold. By means of a lumped Markov chain model of a random walker, a quality measure called "persistence probability" is associated to a cluster, which is then defined as an "?-community" if such a probability is not smaller than ?. Consistently, a partition composed of ?-communities is an "?-partition." These definitions turn out to be very effective for finding and testing communities. If a set of candidate partitions is available, setting the desired ?-level allows one to immediately select the ?-partition with the finest decomposition. Simultaneously, the persistence probabilities quantify the quality of each single community. Given its ability in individually assessing each single cluster, this approach can also disclose single well-defined communities even in networks that overall do not possess a definite clusterized structure.
Project description:Given a large and complex network, we would like to find the best partition of this network into a small number of clusters. This question has been addressed in many different ways. Here we propose a strategy along the lines of optimal prediction for the Markov chains associated with the dynamics on these networks. We develop the necessary ingredients for such an optimal partition strategy, and we compare our strategy with the previous ones. We show that when the Markov chain is lumpable, we recover the partition with respect to which the chain is lumpable. We also discuss the case of well-clustered networks. Finally, we illustrate our strategy on several examples.
Project description:In order to find the origins of electromagnetic noise in the time domain, we formulate a system of lumped parameter circuits and multiconductor transmission lines (MTL). We present a discretized approach to treat any lumped parameter circuits and MTL systems, and the boundary conditions between these systems, where the lumped parameter circuits are described by coupled differential equations, and the MTL systems by coupled partial-differential equations. The introduction of the time-domain impedance and the element matrices enables us to perform a time-domain analysis that includes dependent sources and the coupling devices in the framework of the circuit theory. For three-line systems, we are able to calculate the coupling of the normal, common, and antenna modes, and to find out methods to reduce the noise.
Project description:We present a framework to simulate SIR processes on networks using weighted shortest paths. Our framework maps the SIR dynamics to weights assigned to the edges of the network, which can be done for Markovian and non-Markovian processes alike. The weights represent the propagation time between the adjacent nodes for a particular realization. We simulate the dynamics by constructing an ensemble of such realizations, which can be done by using a Markov Chain Monte Carlo method or by direct sampling. The former provides a runtime advantage when realizations from all possible sources are computed as the weighted shortest paths can be re-calculated more efficiently. We apply our framework to three empirical networks and analyze the expected propagation time between all pairs of nodes. Furthermore, we have employed our framework to perform efficient source detection and to improve strategies for time-critical vaccination.
Project description:In the post-genomic era, Genome-scale metabolic networks (GEMs) have emerged as invaluable tools to understand metabolic capabilities of organisms. Different parts of these metabolic networks are defined as subsystems/pathways, which are sets of functional roles to implement a specific biological process or structural complex, such as glycolysis and TCA cycle. Subsystem/pathway definition is also employed to delineate the biosynthetic routes that produce biomass building blocks. In databases, such as MetaCyc and SEED, these representations are composed of linear routes from precursors to target biomass building blocks. However, this approach cannot capture the nested, complex nature of GEMs. Here we implemented an algorithm, lumpGEM, which generates biosynthetic subnetworks composed of reactions that can synthesize a target metabolite from a set of defined core precursor metabolites. lumpGEM captures balanced subnetworks, which account for the fate of all metabolites along the synthesis routes, thus encapsulating reactions from various subsystems/pathways to balance these metabolites in the metabolic network. Moreover, lumpGEM collapses these subnetworks into elementally balanced lumped reactions that specify the cost of all precursor metabolites and cofactors. It also generates alternative subnetworks and lumped reactions for the same metabolite, accounting for the flexibility of organisms. lumpGEM is applicable to any GEM and any target metabolite defined in the network. Lumped reactions generated by lumpGEM can be also used to generate properly balanced reduced core metabolic models.
Project description:There has been increasing interest in applying Bayesian nonparametric methods in large samples and high dimensions. As Markov chain Monte Carlo (MCMC) algorithms are often infeasible, there is a pressing need for much faster algorithms. This article proposes a fast approach for inference in Dirichlet process mixture (DPM) models. Viewing the partitioning of subjects into clusters as a model selection problem, we propose a sequential greedy search algorithm for selecting the partition. Then, when conjugate priors are chosen, the resulting posterior conditionally on the selected partition is available in closed form. This approach allows testing of parametric models versus nonparametric alternatives based on Bayes factors. We evaluate the approach using simulation studies and compare it with four other fast nonparametric methods in the literature. We apply the proposed approach to three datasets including one from a large epidemiologic study. Matlab codes for the simulation and data analyses using the proposed approach are available online in the supplemental materials.
Project description:The stochastic block model is able to generate random graphs with different types of network partitions, ranging from the traditional assortative structures to the disassortative structures. Since the stochastic block model does not specify which mixing pattern is desired, the inference algorithms discover the locally most likely nodes' partition, regardless of its type. Here we introduce a new model constraining nodes' internal degree ratios in the objective function to guide the inference algorithms to converge to the desired type of structure in the observed network data. We show experimentally that given the regularized model, the inference algorithms, such as Markov chain Monte Carlo, reliably and quickly find the assortative or disassortative structure as directed by the value of a single parameter. In contrast, when the sought-after assortative community structure is not strong in the observed network, the traditional inference algorithms using the degree-corrected stochastic block model tend to converge to undesired disassortative partitions.
Project description:Discrete Markovian models can be used to characterize patterns in sequences of values and have many applications in biological sequence analysis, including gene prediction, CpG island detection, alignment, and protein profiling. We present ToPS, a computational framework that can be used to implement different applications in bioinformatics analysis by combining eight kinds of models: (i) independent and identically distributed process; (ii) variable-length Markov chain; (iii) inhomogeneous Markov chain; (iv) hidden Markov model; (v) profile hidden Markov model; (vi) pair hidden Markov model; (vii) generalized hidden Markov model; and (viii) similarity based sequence weighting. The framework includes functionality for training, simulation and decoding of the models. Additionally, it provides two methods to help parameter setting: Akaike and Bayesian information criteria (AIC and BIC). The models can be used stand-alone, combined in Bayesian classifiers, or included in more complex, multi-model, probabilistic architectures using GHMMs. In particular the framework provides a novel, flexible, implementation of decoding in GHMMs that detects when the architecture can be traversed efficiently.
Project description:Real-world complex networks are composed of non-random quantitative interactions. Identifying communities of nodes that tend to interact more with each other than the network as a whole is a key research focus across multiple disciplines, yet many community detection algorithms only use information about the presence or absence of interactions between nodes. Weighted modularity is a potential method for evaluating the quality of community partitions in quantitative networks. In this framework, the optimal community partition of a network can be found by searching for the partition that maximizes modularity. Attempting to find the partition that maximizes modularity is a computationally hard problem requiring the use of algorithms. QuanBiMo is an algorithm that has been proposed to maximize weighted modularity in bipartite networks. This paper introduces two new algorithms, LPAwb+ and DIRTLPAwb+, for maximizing weighted modularity in bipartite networks. LPAwb+ and DIRTLPAwb+ robustly identify partitions with high modularity scores. DIRTLPAwb+ consistently matched or outperformed QuanBiMo, while the speed of LPAwb+ makes it an attractive choice for detecting the modularity of larger networks. Searching for modules using weighted data (rather than binary data) provides a different and potentially insightful method for evaluating network partitions.
Project description:BACKGROUND: Genome scale data on protein interactions are generally represented as large networks, or graphs, where hundreds or thousands of proteins are linked to one another. Since proteins tend to function in groups, or complexes, an important goal has been to reliably identify protein complexes from these graphs. This task is commonly executed using clustering procedures, which aim at detecting densely connected regions within the interaction graphs. There exists a wealth of clustering algorithms, some of which have been applied to this problem. One of the most successful clustering procedures in this context has been the Markov Cluster algorithm (MCL), which was recently shown to outperform a number of other procedures, some of which were specifically designed for partitioning protein interactions graphs. A novel promising clustering procedure termed Affinity Propagation (AP) was recently shown to be particularly effective, and much faster than other methods for a variety of problems, but has not yet been applied to partition protein interaction graphs. RESULTS: In this work we compare the performance of the Affinity Propagation (AP) and Markov Clustering (MCL) procedures. To this end we derive an unweighted network of protein-protein interactions from a set of 408 protein complexes from S. cervisiae hand curated in-house, and evaluate the performance of the two clustering algorithms in recalling the annotated complexes. In doing so the parameter space of each algorithm is sampled in order to select optimal values for these parameters, and the robustness of the algorithms is assessed by quantifying the level of complex recall as interactions are randomly added or removed to the network to simulate noise. To evaluate the performance on a weighted protein interaction graph, we also apply the two algorithms to the consolidated protein interaction network of S. cerevisiae, derived from genome scale purification experiments and to versions of this network in which varying proportions of the links have been randomly shuffled. CONCLUSION: Our analysis shows that the MCL procedure is significantly more tolerant to noise and behaves more robustly than the AP algorithm. The advantage of MCL over AP is dramatic for unweighted protein interaction graphs, as AP displays severe convergence problems on the majority of the unweighted graph versions that we tested, whereas MCL continues to identify meaningful clusters, albeit fewer of them, as the level of noise in the graph increases. MCL thus remains the method of choice for identifying protein complexes from binary interaction networks.