Project description:Low complexity regions are fragments of protein sequences composed of only a few types of amino acids. These regions frequently occur in proteins and can play an important role in their functions. However, scientists are mainly focused on regions characterized by high diversity of amino acid composition. Similarity between regions of protein sequences frequently reflect functional similarity between them. In this article, we discuss strengths and weaknesses of the similarity analysis of low complexity regions using BLAST, HHblits and CD-HIT. These methods are considered to be the gold standard in protein similarity analysis and were designed for comparison of high complexity regions. However, we lack specialized methods that could be used to compare the similarity of low complexity regions. Therefore, we investigated the existing methods in order to understand how they can be applied to compare such regions. Our results are supported by exploratory study, discussion of amino acid composition and biological roles of selected examples. We show that existing methods need improvements to efficiently search for similar low complexity regions. We suggest features that have to be re-designed specifically for comparing low complexity regions: scoring matrix, multiple sequence alignment, e-value, local alignment and clustering based on a set of representative sequences. Results of this analysis can either be used to improve existing methods or to create new methods for the similarity analysis of low complexity regions.
Project description:Liquid-liquid phase separation seems to play critical roles in the compartmentalization of cells through the formation of biomolecular condensates. Many proteins with low-complexity regions are found in these condensates, and they can undergo phase separation in vitro in response to changes in temperature, pH, and ion concentration. Low-complexity regions are thus likely important players in mediating compartmentalization in response to stress. However, how the phase behavior is encoded in their amino acid composition and patterning is only poorly understood. We discuss here that polymer physics provides a powerful framework for our understanding of the thermodynamics of mixing and demixing and for how the phase behavior is encoded in the primary sequence. We propose to classify low-complexity regions further into subcategories based on their sequence properties and phase behavior. Ongoing research promises to improve our ability to link the primary sequence of low-complexity regions to their phase behavior as well as the emerging miscibility and material properties of the resulting biomolecular condensates, providing mechanistic insight into this fundamental biological process across length scales.
Project description:Full-sequence data available for Plasmodium falciparum chromosomes 2 and 3 are exploited to perform a statistical analysis of the long tracts of biased amino acid composition that characterize the vast majority of P. falciparum proteins and to make a comparison with similarly defined tracts from other simple eukaryotes. When the relatively minor subset of prevalently hydrophobic segments is discarded from the set of low-complexity segments identified by current segmentation methods in P. falciparum proteins, a good correspondence is found between prevalently hydrophilic low-complexity segments and the species-specific, rapidly diverging insertions detected by multiple-alignment procedures when sequences of bona fide homologs are available. Amino acid preferences are fairly uniform in the set of hydrophilic low-complexity segments identified in the two P. falciparum chromosomes sequenced, as well as in sequenced genes from Plasmodium berghei, but differ from those observed in Saccharomyces cerevisiae and Dictyostelium discoideum. In the two plasmodial species, amino acid frequencies do not correlate with properties such as hydrophilicity, small volume, or flexibility, which might be expected to characterize residues involved in nonglobular domains but do correlate with A-richness in codons. An effect of phenotypic selection versus neutral drift, however, is suggested by the predominance of asparagine over lysine.
Project description:Low complexity regions (LCRs) are a common feature shared by many genomes, but their evolutionary and functional significance remains mostly unknown. At the core of the uncertainty is a poor understanding of the mechanisms that regulate their retention in genomes, whether driven by natural selection or neutral evolution. Applying a comparative approach of LCRs to multiple strains and species is a powerful approach to identify patterns of conservation in these regions. Using this method, we investigate the evolutionary history of LCRs in the genus Plasmodium based on orthologous protein coding genes shared by 11 species and strains from primate and rodent-infecting pathogens. We find multiple lines of evidence in support of natural selection as a major evolutionary force shaping the composition and conservation of LCRs through time and signatures that their evolutionary paths are species specific. Our findings add a comparative analysis perspective to the debate on the evolution of LCRs and harness the power of sequence comparisons to identify potential functionally important LCR candidates.
Project description:Residues spanning distinct regions of the low-complexity domain of the RNA-binding protein, Fused in Sarcoma (FUS-LC), form fibril structures with different core morphologies. Solid-state NMR experiments show that the 214-residue FUS-LC forms a fibril with an S-bend (core-1, residues 39-95), while the rest of the protein is disordered. In contrast, the fibrils of the C-terminal variant (FUS-LC-C; residues 111-214) have a U-bend topology (core-2, residues 112-150). Absence of the U-bend in FUS-LC implies that the two fibril cores do not coexist. Computer simulations show that these perplexing findings could be understood in terms of the population of sparsely populated fibril-like excited states in the monomer. The propensity to form core-1 is higher compared to core-2. We predict that core-2 forms only in truncated variants that do not contain the core-1 sequence. At the monomer level, sequence-dependent enthalpic effects determine the relative stabilities of the core-1 and core-2 topologies.
Project description:An amyloidogenic region (AR) in a protein sequence plays a significant role in protein aggregation and amyloid formation. We have investigated the sequence complexity of AR that is present in intrinsically disordered human proteins. More than 80% human proteins in the disordered protein databases (DisProt+IDEAL) contained one or more ARs. With decrease of protein disorder, AR content in the protein sequence was decreased. A probability density distribution analysis and discrete analysis of AR sequences showed that ∼8% residue in a protein sequence was in AR and the region was in average 8 residues long. The residues in the AR were high in sequence complexity and it seldom overlapped with low complexity regions (LCR), which was largely abundant in disorder proteins. The sequences in the AR showed mixed conformational adaptability towards α-helix, β-sheet/strand and coil conformations.
Project description:Low complexity regions (LCRs) play a role in a variety of important biological processes, yet we lack a unified view of their sequences, features, relationships, and functions. Here, we use dotplots and dimensionality reduction to systematically define LCR type/copy relationships and create a map of LCR sequence space capable of integrating LCR features and functions. By defining LCR relationships across the proteome, we provide insight into how LCR type and copy number contribute to higher order assemblies, such as the importance of K-rich LCR copy number for assembly of the nucleolar protein RPA43 in vivo and in vitro. With LCR maps, we reveal the underlying structure of LCR sequence space, and relate differential occupancy in this space to the conservation and emergence of higher order assemblies, including the metazoan extracellular matrix and plant cell wall. Together, LCR relationships and maps uncover and identify scaffold-client relationships among E-rich LCR-containing proteins in the nucleolus, and revealed previously undescribed regions of LCR sequence space with signatures of higher order assemblies, including a teleost-specific T/H-rich sequence space. Thus, this unified view of LCRs enables discovery of how LCRs encode higher order assemblies of organisms.
Project description:Low complexity sequences (LCRs) are well known within coding as well as non-coding sequences. A low complexity region within a protein must be encoded by the underlying DNA sequence. Here, we examine the relationship between the entropy of the protein sequence and that of the DNA sequence which encodes it. We show that they are poorly correlated whether starting with a low complexity region within the protein and comparing it to the corresponding sequence in the DNA or by finding a low complexity region within coding DNA and comparing it to the corresponding sequence in the protein. We show this is the case within the proteomes of five model organisms: Homo sapiens, Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans, and Arabidopsis thaliana. We also report a significant bias against mononucleic codons in LCR encoding sequences. By comparison with simulated proteomes, we show that highly repetitive LCRs may be explained by neutral, slippage-based evolution, but compositionally biased LCRs with cryptic repeats are not. We demonstrate that other biological biases and forces must be acting to create and maintain these LCRs. Uncovering these forces will improve our understanding of protein LCR evolution.
Project description:BACKGROUND: Regions of protein sequences with biased amino acid composition (so-called Low-Complexity Regions (LCRs)) are abundant in the protein universe. A number of studies have revealed that i) these regions show significant divergence across protein families; ii) the genetic mechanisms from which they arise lends them remarkable degrees of compositional plasticity. They have therefore proved difficult to compare using conventional sequence analysis techniques, and functions remain to be elucidated for most of them. Here we undertake a systematic investigation of LCRs in order to explore their possible functional significance, placed in the particular context of Protein-Protein Interaction (PPI) networks and Gene Ontology (GO)-term analysis. RESULTS: In keeping with previous results, we found that LCR-containing proteins tend to have more binding partners across different PPI networks than proteins that have no LCRs. More specifically, our study suggests i) that LCRs are preferentially positioned towards the protein sequence extremities and, in contrast with centrally-located LCRs, such terminal LCRs show a correlation between their lengths and degrees of connectivity, and ii) that centrally-located LCRs are enriched with transcription-related GO terms, while terminal LCRs are enriched with translation and stress response-related terms. CONCLUSIONS: Our results suggest not only that LCRs may be involved in flexible binding associated with specific functions, but also that their positions within a sequence may be important in determining both their binding properties and their biological roles.
Project description:Previous studies have found that DNA-flanking low-complexity regions (LCRs) have an increased substitution rate. Here, the substitution rate was confirmed to increase in the vicinity of LCRs in several primate species, including humans. This effect was also found among human sequences from the 1000 Genomes Project. A strong correlation was found between average substitution rate per site and distance from the LCR, as well as the proportion of genes with gaps in the alignment at each site and distance from the LCR. Along with substitution rates, dN/dS ratios were also determined for each site, and the proportion of sites undergoing negative selection was found to have a negative relationship with distance from the LCR.