A linear programming approach for estimating the structure of a sparse linear genetic network from transcript profiling data.
ABSTRACT: BACKGROUND: A genetic network can be represented as a directed graph in which a node corresponds to a gene and a directed edge specifies the direction of influence of one gene on another. The reconstruction of such networks from transcript profiling data remains an important yet challenging endeavor. A transcript profile specifies the abundances of many genes in a biological sample of interest. Prevailing strategies for learning the structure of a genetic network from high-dimensional transcript profiling data assume sparsity and linearity. Many methods consider relatively small directed graphs, inferring graphs with up to a few hundred nodes. This work examines large undirected graphs representations of genetic networks, graphs with many thousands of nodes where an undirected edge between two nodes does not indicate the direction of influence, and the problem of estimating the structure of such a sparse linear genetic network (SLGN) from transcript profiling data. RESULTS: The structure learning task is cast as a sparse linear regression problem which is then posed as a LASSO (l1-constrained fitting) problem and solved finally by formulating a Linear Program (LP). A bound on the Generalization Error of this approach is given in terms of the Leave-One-Out Error. The accuracy and utility of LP-SLGNs is assessed quantitatively and qualitatively using simulated and real data. The Dialogue for Reverse Engineering Assessments and Methods (DREAM) initiative provides gold standard data sets and evaluation metrics that enable and facilitate the comparison of algorithms for deducing the structure of networks. The structures of LP-SLGNs estimated from the INSILICO1, INSILICO2 and INSILICO3 simulated DREAM2 data sets are comparable to those proposed by the first and/or second ranked teams in the DREAM2 competition. The structures of LP-SLGNs estimated from two published Saccharomyces cerevisae cell cycle transcript profiling data sets capture known regulatory associations. In each S. cerevisiae LP-SLGN, the number of nodes with a particular degree follows an approximate power law suggesting that its degree distributions is similar to that observed in real-world networks. Inspection of these LP-SLGNs suggests biological hypotheses amenable to experimental verification. CONCLUSION: A statistically robust and computationally efficient LP-based method for estimating the topology of a large sparse undirected graph from high-dimensional data yields representations of genetic networks that are biologically plausible and useful abstractions of the structures of real genetic networks. Analysis of the statistical and topological properties of learned LP-SLGNs may have practical value; for example, genes with high random walk betweenness, a measure of the centrality of a node in a graph, are good candidates for intervention studies and hence integrated computational - experimental investigations designed to infer more realistic and sophisticated probabilistic directed graphical model representations of genetic networks. The LP-based solutions of the sparse linear regression problem described here may provide a method for learning the structure of transcription factor networks from transcript profiling and transcription factor binding motif data.
Project description:In directed graphs, relationships are asymmetric and these asymmetries contain essential structural information about the graph. Directed relationships lead to a new type of clustering that is not feasible in undirected graphs. We propose a spectral co-clustering algorithm called di-sim for asymmetry discovery and directional clustering. A Stochastic co-Blockmodel is introduced to show favorable properties of di-sim To account for the sparse and highly heterogeneous nature of directed networks, di-sim uses the regularized graph Laplacian and projects the rows of the eigenvector matrix onto the sphere. A nodewise asymmetry score and di-sim are used to analyze the clustering asymmetries in the networks of Enron emails, political blogs, and the Caenorhabditis elegans chemical connectome. In each example, a subset of nodes have clustering asymmetries; these nodes send edges to one cluster, but receive edges from another cluster. Such nodes yield insightful information (e.g., communication bottlenecks) about directed networks, but are missed if the analysis ignores edge direction.
Project description:Biological network data, such as metabolic-, signaling- or physical interaction graphs of proteins are increasingly available in public repositories for important species. Tools for the quantitative analysis of these networks are being developed today. Protein network-based drug target identification methods usually return protein hubs with large degrees in the networks as potentially important targets. Some known, important protein targets, however, are not hubs at all, and perturbing protein hubs in these networks may have several unwanted physiological effects, due to their interaction with numerous partners. Here, we show a novel method applicable in networks with directed edges (such as metabolic networks) that compensates for the low degree (non-hub) vertices in the network, and identifies important nodes, regardless of their hub properties. Our method computes the PageRank for the nodes of the network, and divides the PageRank by the in-degree (i.e., the number of incoming edges) of the node. This quotient is the same in all nodes in an undirected graph (even for large- and low-degree nodes, that is, for hubs and non-hubs as well), but may differ significantly from node to node in directed graphs. We suggest to assign importance to non-hub nodes with large PageRank/in-degree quotient. Consequently, our method gives high scores to nodes with large PageRank, relative to their degrees: therefore non-hub important nodes can easily be identified in large networks. We demonstrate that these relatively high PageRank scores have biological relevance: the method correctly finds numerous already validated drug targets in distinct organisms (Mycobacterium tuberculosis, Plasmodium falciparum and MRSA Staphylococcus aureus), and consequently, it may suggest new possible protein targets as well. Additionally, our scoring method was not chosen arbitrarily: its value for all nodes of all undirected graphs is constant; therefore its high value captures importance in the directed edge structure of the graph.
Project description:We introduce a framework for the discovery of dominant relationship patterns in complex networks, by compressing the networks into power graphs with overlapping power nodes. When paired with enrichment analysis of node classification terms, the most compressible sets of edges provide a highly informative sketch of the dominant relationship patterns that define the network. In addition, this procedure also gives rise to a novel, link-based definition of overlapping node communities in which nodes are defined by their relationships with sets of other nodes, rather than through connections within the community. We show that this completely general approach can be applied to undirected, directed, and bipartite networks, yielding valuable insights into the large-scale structure of real-world networks, including social networks and food webs. Our approach therefore provides a novel way in which network architecture can be studied, defined and classified.
Project description:With protein or gene interaction systems as the background, this paper proposes an evolving model of biological undirected networks, which are consistent with some plausible mechanisms in biology. Through introducing a rule of preferential duplication of a node inversely proportional to the degree of existing nodes and an attribute of the age of the node (the older, the more influence), by which the probability of a node receiving re-wiring links is chosen, the model networks generated in certain parameter conditions could reproduce series of statistic topological characteristics of real biological graphs, including the scale-free feature, small world effect, hierarchical modularity, limited structural robustness, and disassortativity of degree-degree correlation.
Project description:BACKGROUND:Metabolic networks reflect the relationships between metabolites (biomolecules) and the enzymes (proteins), and are of particular interest since they describe all chemical reactions of an organism. The metabolic networks are constructed from the genome sequence of an organism, and the graphs can be used to study fluxes through the reactions, or to relate the graph structure to environmental characteristics and phenotypes. About ten years ago, Takemoto et al. (2007) stated that the structure of prokaryotic metabolic networks represented as undirected graphs, is correlated to their living environment. Although metabolic networks are naturally directed graphs, they are still usually analysed as undirected graphs. RESULTS:We implemented a pipeline to reconstruct metabolic networks from genome data and confirmed some of the results of Takemoto et al. (2007) with today data using up-to-date databases. However, Takemoto et al. (2007) used only a fraction of all available enzymes from the genome and taking into account all the enzymes we fail to reproduce the main results. Therefore, we introduce three robust measures on directed representations of graphs, which lead to similar results regardless of the method of network reconstruction. We show that the size of the largest strongly connected component, the flow hierarchy and the Laplacian spectrum are strongly correlated to the environmental conditions. CONCLUSIONS:We found a significant negative correlation between the size of the largest strongly connected component (a cycle) and the optimal growth temperature of the considered prokaryotes. This relationship holds true for the spectrum, high temperature being associated with lower eigenvalues. The hierarchy flow shows a negative correlation with optimal growth temperature. This suggests that the dynamical properties of the network are dependant on environmental factors.
Project description:Population structure can be modeled by evolutionary graphs, which can have a substantial influence on the fate of mutants. Individuals are located on the nodes of these graphs, competing to take over the graph via the links. Applications for this framework range from the ecology of river systems and cancer initiation in colonic crypts to biotechnological search for optimal mutations. In all these applications, both the probability of fixation and the associated time are of interest. We study this problem for all undirected and unweighted graphs up to a certain size. We devise a genetic algorithm to find graphs with high or low fixation probability and short or long fixation time and study their structure searching for common themes. Our work unravels structural properties that maximize or minimize fixation probability and time, which allows us to contribute to a first map of the universe of evolutionary graphs.
Project description:Locating sources of diffusion and spreading from minimum data is a significant problem in network science with great applied values to the society. However, a general theoretical framework dealing with optimal source localization is lacking. Combining the controllability theory for complex networks and compressive sensing, we develop a framework with high efficiency and robustness for optimal source localization in arbitrary weighted networks with arbitrary distribution of sources. We offer a minimum output analysis to quantify the source locatability through a minimal number of messenger nodes that produce sufficient measurement for fully locating the sources. When the minimum messenger nodes are discerned, the problem of optimal source localization becomes one of sparse signal reconstruction, which can be solved using compressive sensing. Application of our framework to model and empirical networks demonstrates that sources in homogeneous and denser networks are more readily to be located. A surprising finding is that, for a connected undirected network with random link weights and weak noise, a single messenger node is sufficient for locating any number of sources. The framework deepens our understanding of the network source localization problem and offers efficient tools with broad applications.
Project description:Exponential random graph models (ERGMs) are widely used for modeling social networks observed at one point in time. However the computational difficulty of ERGM parameter estimation has limited the practical application of this class of models to relatively small networks, up to a few thousand nodes at most, with usually only a few hundred nodes or fewer. In the case of undirected networks, snowball sampling can be used to find ERGM parameter estimates of larger networks via network samples, and recently published improvements in ERGM network distribution sampling and ERGM estimation algorithms have allowed ERGM parameter estimates of undirected networks with over one hundred thousand nodes to be made. However the implementations of these algorithms to date have been limited in their scalability, and also restricted to undirected networks. Here we describe an implementation of the recently published Equilibrium Expectation (EE) algorithm for ERGM parameter estimation of large directed networks. We test it on some simulated networks, and demonstrate its application to an online social network with over 1.6 million nodes.
Project description:Coordinated patterns of cortical morphology have been described as structural graphs and previous research has demonstrated that properties of such graphs are altered in Alzheimer's disease (AD). However, it remains unknown how these alterations are related to cognitive deficits in individuals, as such graphs are restricted to group-level analysis. In the present study we investigated this question in single-subject grey matter networks. This new method extracts large-scale structural graphs where nodes represent small cortical regions that are connected by edges when they show statistical similarity. Using this method, unweighted and undirected networks were extracted from T1 weighted structural magnetic resonance imaging scans of 38 AD patients (19 female, average age 72±4 years) and 38 controls (19 females, average age 72±4 years). Group comparisons of standard graph properties were performed after correcting for grey matter volumetric measurements and were correlated to scores of general cognitive functioning. AD networks were characterised by a more random topology as indicated by a decreased small world coefficient (p?=?3.53×10(-5)), decreased normalized clustering coefficient (p?=?7.25×10(-6)) and decreased normalized path length (p?=?1.91×10(-7)). Reduced normalized path length explained significantly (p?=?0.004) more variance in measurements of general cognitive decline (32%) in comparison to volumetric measurements (9%). Altered path length of the parahippocampal gyrus, hippocampus, fusiform gyrus and precuneus showed the strongest relationship with cognitive decline. The present results suggest that single-subject grey matter graphs provide a concise quantification of cortical structure that has clinical value, which might be of particular importance for disease prognosis. These findings contribute to a better understanding of structural alterations and cognitive dysfunction in AD.
Project description:Systems genetic studies have been used to identify genetic loci that affect transcript abundances and clinical traits such as body weight. The pairwise correlations between gene expression traits and/or clinical traits can be used to define undirected trait networks. Several authors have argued that genetic markers (e.g expression quantitative trait loci, eQTLs) can serve as causal anchors for orienting the edges of a trait network. The availability of hundreds of thousands of genetic markers poses new challenges: how to relate (anchor) traits to multiple genetic markers, how to score the genetic evidence in favor of an edge orientation, and how to weigh the information from multiple markers.We develop and implement Network Edge Orienting (NEO) methods and software that address the challenges of inferring unconfounded and directed gene networks from microarray-derived gene expression data by integrating mRNA levels with genetic marker data and Structural Equation Model (SEM) comparisons. The NEO software implements several manual and automatic methods for incorporating genetic information to anchor traits. The networks are oriented by considering each edge separately, thus reducing error propagation. To summarize the genetic evidence in favor of a given edge orientation, we propose Local SEM-based Edge Orienting (LEO) scores that compare the fit of several competing causal graphs. SEM fitting indices allow the user to assess local and overall model fit. The NEO software allows the user to carry out a robustness analysis with regard to genetic marker selection. We demonstrate the utility of NEO by recovering known causal relationships in the sterol homeostasis pathway using liver gene expression data from an F2 mouse cross. Further, we use NEO to study the relationship between a disease gene and a biologically important gene co-expression module in liver tissue.The NEO software can be used to orient the edges of gene co-expression networks or quantitative trait networks if the edges can be anchored to genetic marker data. R software tutorials, data, and supplementary material can be downloaded from: http://www.genetics.ucla.edu/labs/horvath/aten/NEO.