Surveying the manifold divergence of an entire protein class for statistical clues to underlying biochemical mechanisms.
ABSTRACT: Certain residues have no known function yet are co-conserved across distantly related protein families and diverse organisms, suggesting that they perform critical roles associated with as-yet-unidentified molecular properties and mechanisms. This raises the question of how to obtain additional clues regarding these mysterious biochemical phenomena with a view to formulating experimentally testable hypotheses. One approach is to access the implicit biochemical information encoded within the vast amount of genomic sequence data now becoming available. Here, a new Gibbs sampling strategy is formulated and implemented that can partition hundreds of thousands of sequences within a major protein class into multiple, functionally-divergent categories based on those pattern residues that best discriminate between categories. The sampler precisely defines the partition and pattern for each category by explicitly modeling unrelated, non-functional and related-yet-divergent proteins that would otherwise obscure the analysis. To aid biological interpretation, auxiliary routines can characterize pattern residues within available crystal structures and identify those structures most likely to shed light on the roles of pattern residues. This approach can be used to define and annotate automatically subgroup-specific conserved domain profiles based on statistically-rigorous empirical criteria rather than on the subjective and labor-intensive process of manual curation. Incorporating such profiles into domain database search sites (such as the NCBI BLAST site) will provide biologists with previously inaccessible molecular information useful for hypothesis generation and experimental design. Analyses of P-loop GTPases and of AAA+ ATPases illustrate the sampler's ability to obtain such information.
Project description:Certain residues within proteins are highly conserved across very distantly related organisms, yet their (presumably critical) structural or mechanistic roles are completely unknown. To obtain clues regarding such residues within Arf and Arf-like (Arf/Arl) GTPases--which function as on/off switches regulating vesicle trafficking, phospholipid metabolism and cytoskeletal remodeling--I apply a new sampling procedure for comparative sequence analysis, termed multiple category Bayesian Partitioning with Pattern Selection (mcBPPS).The mcBPPS sampler classified sequences within the entire P-loop GTPase class into multiple categories by identifying those evolutionarily-divergent residues most likely to be responsible for functional specialization. Here I focus on categories of residues that most distinguish various Arf/Arl GTPases from other GTPases. This identified residues whose specific roles have been previously proposed (and in some cases corroborated experimentally and that thus serve as positive controls), as well as several categories of co-conserved residues whose possible roles are first hinted at here. For example, Arf/Arl/Sar GTPases are most distinguished from other GTPases by a conserved aspartate residue within the phosphate binding loop (P-loop) and by co-conserved residues nearby that, together, can form a network of salt-bridge and hydrogen bond interactions centered on the GTPase active site. Residues corresponding to an N-[VI] motif that is conserved within Arf/Arl GTPases may play a role in the interswitch toggle characteristic of the Arf family, whereas other, co-conserved residues may modulate the flexibility of the guanine binding loop. Arl8 GTPases conserve residues that strikingly diverge from those typically found in other Arf/Arl GTPases and that form structural interactions suggestive of a novel interswitch toggle mechanism.This analysis suggests specific mutagenesis experiments to explore mechanisms underlying GTP hydrolysis, nucleotide exchange and interswitch toggling within Arf/Arl GTPases. More generally, it illustrates how the mcBPPS sampler can complement traditional evolutionary analyses by providing an objective, quantitative and statistically rigorous way to explore protein functional-divergence in molecular detail. Because the sampler classifies the input sequences at the same time, it can be used to generate subgroup profiles, in which functionally-divergent categories of residues are annotated automatically.
Project description:Visual categorization is the brain computation that reduces high-dimensional information in the visual environment into a smaller set of meaningful categories. An important problem in visual neuroscience is to identify the visual information that the brain must represent and then use to categorize visual inputs. Here we introduce a new mathematical formalism-termed space-by-time manifold decomposition-that describes this information as a low-dimensional manifold separable in space and time. We use this decomposition to characterize the representations used by observers to categorize the six classic facial expressions of emotion (happy, surprise, fear, disgust, anger, and sad). By means of a Generative Face Grammar, we presented random dynamic facial movements on each experimental trial and used subjective human perception to identify the facial movements that correlate with each emotion category. When the random movements projected onto the categorization manifold region corresponding to one of the emotion categories, observers categorized the stimulus accordingly; otherwise they selected "other." Using this information, we determined both the Action Unit and temporal components whose linear combinations lead to reliable categorization of each emotion. In a validation experiment, we confirmed the psychological validity of the resulting space-by-time manifold representation. Finally, we demonstrated the importance of temporal sequencing for accurate emotion categorization and identified the temporal dynamics of Action Unit components that cause typical confusions between specific emotions (e.g., fear and surprise) as well as those resolving these confusions.
Project description:Haplotypic sequences contain significantly more information than genotypes of genetic markers and are critical for studying disease association and genome evolution. Current methods for obtaining haplotypic sequences require the physical separation of alleles before sequencing, are time consuming and are not scaleable for large surveys of genetic variation. We have developed a novel method for acquiring haplotypic sequences from long PCR products using simple, high-throughput techniques. This method applies modified shotgun sequencing protocols to sequence both alleles concurrently, with read-pair information allowing the two alleles to be separated during sequence assembly. Although the haplotypic sequences can be assembled manually from the resultant data using pre-existing sequence assembly software, we have devised a novel heuristic algorithm to automate assembly and remove human error. We validated the approach on two long PCR products amplified from the human genome and confirmed the accuracy of our sequences against full-length clones of the same alleles. This method presents a simple high-throughput means to obtain full haplotypic sequences potentially up to 20 kb in length and is suitable for surveying genetic variation even in poorly-characterized genomes as it requires no prior information on sequence variation.
Project description:When allocating resources, people often diversify across categories even when those categories are arbitrary, such that allocations differ when identical sets of options are partitioned differently ("partition dependence"). The first goal of the present work (Experiment 1) was to replicate an experiment by Fox and colleagues in which graduate students exhibited partition dependence when asked how university financial aid should be allocated across arbitrarily partitioned income brackets. Our sample consisted of community members at a liberal arts college where financial aid practices have been recent topics of debate. Because stronger intrinsic preferences can reduce partition dependence, these participants might display little partition dependence with financial aid allocations. Alternatively, a demonstration of strong partition dependence in this population would emphasize the robustness of the effect. The second goal was to extend a "high transparency" modification to the present task context (Experiment 2) in which participants were shown both possible income partitions and randomly assigned themselves to one, to determine whether partition dependence in this paradigm would be reduced by revealing the study design (and the arbitrariness of income categories). Participants demonstrated clear partition dependence in both experiments. Results demonstrate the robustness of partition dependence in this context.
Project description:The multiple sequence alignment (MSA) of a protein family provides a wealth of information in terms of the conservation pattern of amino acid residues not only at each alignment site but also between distant sites. In order to statistically model the MSA incorporating both short-range and long-range correlations as well as insertions, I have derived a lattice gas model of the MSA based on the principle of maximum entropy. The partition function, obtained by the transfer matrix method with a mean-field approximation, accounts for all possible alignments with all possible sequences. The model parameters for short-range and long-range interactions were determined by a self-consistent condition and by a Gaussian approximation, respectively. Using this model with and without long-range interactions, I analyzed the globin and V-set domains by increasing the "temperature" and by "mutating" a site. The correlations between residue conservation and various measures of the system's stability indicate that the long-range interactions make the conservation pattern more specific to the structure, and increasingly stabilize better conserved residues.
Project description:Minimal hardware implementations able to cope with the processing of large amounts of data in reasonable times are highly desired in our information-driven society. In this work we review the application of stochastic computing to probabilistic-based pattern-recognition analysis of huge database sets. The proposed technique consists in the hardware implementation of a parallel architecture implementing a similarity search of data with respect to different pre-stored categories. We design pulse-based stochastic-logic blocks to obtain an efficient pattern recognition system. The proposed architecture speeds up the screening process of huge databases by a factor of 7 when compared to a conventional digital implementation using the same hardware area.
Project description:Nucleotide binding domains (NBDs) of the multidrug transporter of Candida albicans, CaCdr1p, possess unique divergent amino acids in their conserved motifs. For example, NBD1 (N-terminal-NBD) possesses conserved signature motifs, while the same motif is divergent in NBD2 (C-terminal-NBD). In this study, we have evaluated the contribution of these conserved and divergent signature motifs of CaCdr1p in ATP catalysis and drug transport. By employing site-directed mutagenesis, we made three categories of mutant variants. These included mutants where all the signature motif residues were replaced with either alanines or mutants with exchanged equipositional residues to mimic the conservancy and degeneracy in opposite domain. In addition, a set of mutants where signature motifs were swapped to have variants with either both the conserved or degenerated entire signature motif. We observed that conserved and equipositional residues of NBD1 and NBD2 and swapped signature motif mutants showed high susceptibility to all the tested drugs with simultaneous abrogation in ATPase and R6G efflux activities. However, some of the mutants displayed a selective increase in susceptibility to the drugs. Notably, none of the mutant variants and WT-CaCdr1p showed any difference in drug and nucleotide binding. Our mutational analyses show not only that certain conserved residues of NBD1 signature sequence (S304, G306, and E307) are important in ATP hydrolysis and R6G efflux but also that a few divergent residues (N1002 and E1004) of NBD2 signature motif have evolved to be functionally relevant and are not interchangeable. Taken together, our data suggest that the signature motifs of CaCdr1p, whether it is divergent or conserved, are nonexchangeable and are functionally critical for ATP hydrolysis.
Project description:Sea cucumbers are prolific producers of a wide range of bioactive compounds. This study aimed to purify and characterize one class of compound, the saponins, from the viscera of the Australian sea cucumber Holothuria lessoni. The saponins were obtained by ethanolic extraction of the viscera and enriched by a liquid-liquid partition process and adsorption column chromatography. A high performance centrifugal partition chromatography (HPCPC) was applied to the saponin-enriched mixture to obtain saponins with high purity. The resultant purified saponins were profiled using MALDI-MS/MS and ESI-MS/MS which revealed the structure of isomeric saponins to contain multiple aglycones and/or sugar residues. We have elucidated the structure of five novel saponins, Holothurins D/E and Holothurinosides X/Y/Z, along with seven reported triterpene glycosides, including sulfated and non-sulfated saponins containing a range of aglycones and sugar moieties, from the viscera of H. lessoni. The abundance of novel compounds from this species holds promise for biotechnological applications.
Project description:Quantitative interpretation and prediction of Hofmeister ion effects on protein processes, including folding and crystallization, have been elusive goals of a century of research. Here, a quantitative thermodynamic analysis, developed to treat noncoulombic interactions of solutes with biopolymer surface and recently extended to analyze the effects of Hofmeister salts on the surface tension of water, is applied to literature solubility data for small hydrocarbons and model peptides. This analysis allows us to obtain a minimum estimate of the hydration b1 (H2O A(-2)), of hydrocarbon surface and partition coefficients Kp, characterizing the distribution of salts and salt ions between this hydration water and bulk water. Assuming that Na+ and SO4(2-) ions of Na2SO4 (the salt giving the largest reduction in hydrocarbon solubility as well as the largest increase in surface tension) are fully excluded from the hydration water at hydrocarbon surface, we obtain the same b1 as for air-water surface (approximately 0.18 H2O A(-2)). Rank orders of cation and anion partition coefficients for nonpolar surface follow the Hofmeister series for protein processes, but are strongly offset for cations in the direction of exclusion (preferential hydration). By applying a coarse-grained decomposition of water accessible surface area (ASA) into nonpolar, polar amide, and other polar surface and the same hydration b1 to interpret peptide solubility increments, we determine salt partition coefficients for amide surface. These partition coefficients are separated into single-ion contributions based on the observation that both Cl- and Na+ (also K+) occupy neutral positions in the middle of the anion and cation Hofmeister series for protein folding. Independent of this assignment, we find that all cations investigated are strongly accumulated at amide surface while most anions are excluded. Cation and anion effects are independent and additive, allowing successful prediction of Hofmeister salt effects on micelle formation and other processes from structural information (ASA).
Project description:OBJECTIVE:We investigate surname affinities among areas of modern-day China, by constructing a spatial network, and making community detection. It reports a geographical genealogy of the Chinese population that is result of population origins, historical migrations, and societal evolutions. MATERIALS AND METHODS:We acquire data from the census records supplied by China's National Citizen Identity Information System, including the surname and regional information of 1.28 billion registered Chinese citizens. We propose a multilayer minimum spanning tree (MMST) to construct a spatial network based on the matrix of isonymic distances, which is often used to characterize the dissimilarity of surname structure among areas. We use the fast unfolding algorithm to detect network communities. RESULTS:We obtain a 10-layer MMST network of 362 prefecture nodes and 3,610 edges derived from the matrix of the Euclidean distances among these areas. These prefectures are divided into eight groups in the spatial network via community detection. We measure the partition by comparing the inter-distances and intra-distances of the communities and obtain meaningful regional ethnicity classification. DISCUSSION:The visualization of the resulting communities on the map indicates that the prefectures in the same community are usually geographically adjacent. The formation of this partition is influenced by geographical factors, historic migrations, trade and economic factors, as well as isolation of culture and language. The MMST algorithm proves to be effective in geo-genealogy and ethnicity classification for it retains essential information about surname affinity and highlights the geographical consanguinity of the population.