ABSTRACT: Protein backbones have characteristic secondary structures, including alpha-helices and beta-sheets. Which structure is adopted locally is strongly biased by the local amino acid sequence of the protein. Accurate (probabilistic) mappings from sequence to structure are valuable for both secondary-structure prediction and protein design. For the case of alpha-helix caps, we test whether the information content of the sequence-structure mapping can be self-consistently improved by using a relaxed definition of the structure. We derive helix-cap sequence motifs using database helix assignments for proteins of known structure. These motifs are refined using Gibbs sampling in competition with a null motif. Then Gibbs sampling is repeated, allowing for frameshifts of +/-1 amino acid residue, in order to find sequence motifs of higher total information content. All helix-cap motifs were found to have good generalization capability, as judged by training on a small set of non-redundant proteins and testing on a larger set. For overall prediction purposes, frameshift motifs using all training examples yielded the best results. Frameshift motifs using a fraction of all training examples performed best in terms of true positives among top predictions. However, motifs without frameshifts also performed well, despite a roughly one-third lower total information content.
Project description:The alignment of DNA sequences to proteins, allowing for frameshifts, is a classic method in sequence analysis. It can help identify pseudogenes (which accumulate mutations), analyze raw DNA and RNA sequence data (which may have frameshift sequencing errors), investigate ribosomal frameshifts, etc. Often, however, only ad hoc approximations or simulations are available to provide the statistical significance of a frameshift alignment score.We describe a method to estimate statistical significance of frameshift alignments, similar to classic BLAST statistics. (BLAST presently does not permit its alignments to include frameshifts.) We also illustrate the continuing usefulness of frameshift alignment with two 'post-genomic' applications: (i) when finding pseudogenes within the human genome, frameshift alignments show that most anciently conserved non-coding human elements are recent pseudogenes with conserved ancestral genes; and (ii) when analyzing metagenomic DNA reads from polluted soil, frameshift alignments show that most alignable metagenomic reads contain frameshifts, suggesting that metagenomic analysis needs to use frameshift alignment to derive accurate results.
Project description:Our goal was to identify evolutionary conserved frame transitions in protein coding regions and to uncover an underlying functional role of these structural aberrations. We used the ab initio frameshift prediction program, GeneTack, to detect reading frame transitions in 206 991 genes (fs-genes) from 1106 complete prokaryotic genomes. We grouped 102 731 fs-genes into 19 430 clusters based on sequence similarity between protein products (fs-proteins) as well as conservation of predicted position of the frameshift and its direction. We identified 4010 pseudogene clusters and 146 clusters of fs-genes apparently using recoding (local deviation from using standard genetic code) due to possessing specific sequence motifs near frameshift positions. Particularly interesting was finding of a novel type of organization of the dnaX gene, where recoding is required for synthesis of the longer subunit, ?. We selected 20 clusters of predicted recoding candidates and designed a series of genetic constructs with a reporter gene or affinity tag whose expression would require a frameshift event. Expression of the constructs in Escherichia coli demonstrated enrichment of the set of candidates with sequences that trigger genuine programmed ribosomal frameshifting; we have experimentally confirmed four new families of programmed frameshifts.
Project description:Frameshifts in protein coding sequences are widely perceived as resulting in either nonfunctional or even deleterious protein products. Indeed, frameshifts typically lead to markedly altered protein sequences and premature stop codons. By analyzing complete proteomes from all three domains of life, we demonstrate that, in contrast, several key physicochemical properties of protein sequences exhibit significant robustness against +1 and -1 frameshifts. In particular, we show that hydrophobicity profiles of many protein sequences remain largely invariant upon frameshifting. For example, over 2,900 human proteins exhibit a Pearson's correlation coefficient R between the hydrophobicity profiles of the original and the +1-frameshifted variants greater than 0.7, despite an average sequence identity between the two of only 6.5% in this group. We observe a similar effect for protein sequence profiles of affinity for certain nucleobases as well as protein sequence profiles of intrinsic disorder. Finally, analysis of significance and optimality demonstrates that frameshift stability is embedded in the structure of the universal genetic code and may have contributed to shaping it. Our results suggest that frameshifting may be a powerful evolutionary mechanism for creating new proteins with vastly different sequences, yet similar physicochemical properties to the proteins from which they originate.
Project description:The parallel ?-helix is a geometrically regular fold commonly found in the proteomes of bacteria, viruses, fungi, archaea, and some vertebrates. ?-helix structure has been observed in monomeric units of some aggregated amyloid fibers. In contrast, soluble ?-helices, both right- and left-handed, are usually "capped" on each end by one or more secondary structures. Here, an in-depth classification of the diverse range of ?-helix cap structures reveals subtle commonalities in structural components and in interactions with the ?-helix core. Based on these uncovered commonalities, a toolkit of automated predictors was developed for the two distinct types of cap structures. In vitro deletion of the toolkit-predicted C-terminal cap from the pertactin ?-helix resulted in increased aggregation and the formation of soluble oligomeric species. These results suggest that ?-helix cap motifs can prevent specific, ?-sheet-mediated oligomeric interactions, similar to those observed in amyloid formation.
Project description:A new mathematical method for potential reading frameshift detection in protein-coding sequences (cds) was developed. The algorithm is adjusted to the triplet periodicity of each analysed sequence using dynamic programming and a genetic algorithm. This does not require any preliminary training. Using the developed method, cds from the Arabidopsis thaliana genome were analysed. In total, the algorithm found 9,930 sequences containing one or more potential reading frameshift(s). This is ?21% of all analysed sequences of the genome. The Type I and Type II error rates were estimated as 11% and 30%, respectively. Similar results were obtained for the genomes of Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, Rattus norvegicus and Xenopus tropicalis. Also, the developed algorithm was tested on 17 bacterial genomes. We compared our results with the previously obtained data on the search for potential reading frameshifts in these genomes. This study discussed the possibility that the reading frameshift seems like a relatively frequently encountered mutation; and this mutation could participate in the creation of new genes and proteins.
Project description:DNA shape adds specificity to sequence motifs but has not been explored systematically outside this context. We hypothesized that DNA-binding proteins (DBPs) preferentially occupy DNA with specific structures ("shape motifs") regardless of whether or not these correspond to high information content sequence motifs. We present ShapeMF, a Gibbs sampling algorithm that identifies de novo shape motifs. Using binding data from hundreds of in vivo and in vitro experiments, we show that most DBPs have shape motifs and can occupy these in the absence of sequence motifs. This "shape-only binding" is common for many DBPs and in regions co-bound by multiple DBPs. When shape and sequence motifs co-occur, they can be overlapping, flanking, or separated by consistent spacing. Finally, DBPs within the same protein family have different shape motifs, explaining their distinct genome-wide occupancy despite having similar sequence motifs. These results suggest that shape motifs not only complement sequence motifs but also facilitate recognition of DNA beyond conventionally defined sequence motifs.
Project description:Haemophilus influenzae HxuA is a cell-surface protein with haem-haemopexin binding activity which is key to haem acquisition from haemopexin and thus is one of the potential sources of haem for this microorganism. HxuA is secreted by its specific transporter HxuB. HxuA/HxuB belongs to the so-called two-partner secretion systems (TPSs) that are characterized by a conserved N-terminal domain in the secreted protein which is essential for secretion. Here, the 1.5?Å resolution structure of the secretion domain of HxuA, HxuA301, is reported. The structure reveals that HxuA301 folds into a ?-helix domain with two extra-helical motifs, a four-stranded ?-sheet and an N-terminal cap. Comparisons with other structures of TpsA secretion domains are reported. They reveal that despite limited sequence identity, strong structural similarities are found between the ?-helix motifs, consistent with the idea that the TPS domain plays a role not only in the interaction with the specific TpsB partners but also as the scaffold initiating progressive folding of the TpsA proteins at the bacterial surface.
Project description:The genome of Helicobacter pylori is remarkable for its large number of restriction-modification (R-M) systems, and strain-specific diversity in R-M systems has been suggested to limit natural transformation, the major driving force of genetic diversification in H. pylori. We have determined the comprehensive methylomes of two H. pylori strains at single base resolution, using Single Molecule Real-Time (SMRT®) sequencing. For strains 26695 and J99-R3, 17 and 22 methylated sequence motifs were identified, respectively. For most motifs, almost all sites occurring in the genome were detected as methylated. Twelve novel methylation patterns corresponding to nine recognition sequences were detected (26695, 3; J99-R3, 6). Functional inactivation, correction of frameshifts as well as cloning and expression of candidate methyltransferases (MTases) permitted not only the functional characterization of multiple, yet undescribed, MTases, but also revealed novel features of both Type I and Type II R-M systems, including frameshift-mediated changes of sequence specificity and the interaction of one MTase with two alternative specificity subunits resulting in different methylation patterns. The methylomes of these well-characterized H. pylori strains will provide a valuable resource for future studies investigating the role of H. pylori R-M systems in limiting transformation as well as in gene regulation and host interaction.
Project description:Synthesis of the Gag-Pol protein of the human immunodeficiency virus type 1 (HIV-1) requires a programmed -1 ribosomal frameshifting when ribosomes translate the unspliced viral messenger RNA. This frameshift occurs at a slippery sequence followed by an RNA structure motif that stimulates frameshifting. This motif is commonly assumed to be a simple stem-loop for HIV-1. In this study, we show that the frameshift stimulatory signal is more complex than believed and consists of a two-stem helix. The upper stem-loop corresponds to the classic stem-loop, and the lower stem is formed by pairing the spacer region following the slippery sequence and preceding this classic stem-loop with a segment downstream of this stem-loop. A three-purine bulge interrupts the two stems. This structure was suggested by enzymatic probing with nuclease V1 of an RNA fragment corresponding to the gag/pol frameshift region of HIV-1. The involvement of the novel lower stem in frameshifting was supported by site-directed mutagenesis. A fragment encompassing the gag/pol frameshift region of HIV-1 was inserted in the beginning of the coding sequence of a reporter gene coding for the firefly luciferase, such that expression of luciferase requires a -1 frameshift. When the reporter was expressed in COS cells, mutations that disrupt the capacity to form the lower stem reduced frameshifting, whereas compensatory changes that allow re-formation of this stem restored the frameshift efficiency near wild-type level. The two-stem structure that we propose for the frameshift stimulatory signal of HIV-1 differs from the RNA triple helix structure recently proposed.
Project description:It is generally accepted that longer microsatellites mutate more frequently in defective DNA mismatch repair (MMR) than shorter microsatellites. Indeed, we have previously observed that the A10 microsatellite of transforming growth factor beta type II receptor (TGFBR2) frameshifts -1 bp at a faster rate than the A8 microsatellite of activin type II receptor (ACVR2), although both genes become frameshift-mutated in >80% of MMR-defective colorectal cancers. To experimentally determine the effect of microsatellite length upon frameshift mutation in gene-specific sequence contexts, we altered the microsatellite length within TGFBR2 exon 3 and ACVR2 exon 10, generating A7, A10 and A13 constructs. These constructs were cloned 1 bp out of frame of EGFP, allowing a -1 bp frameshift to drive EGFP expression, and stably transfected into MMR-deficient cells. Subsequent non-fluorescent cells were sorted, cultured for 7-35 days and harvested for EGFP analysis and DNA sequencing. Longer microsatellites within TGFBR2 and ACVR2 showed significantly higher mutation rates than shorter ones, with TGFBR2 A13, A10 and A7 frameshifts measured at 22.38x10(-4), 2.17x10(-4) and 0.13x10(-4), respectively. Surprisingly, shorter ACVR2 constructs showed three times higher mutation rates at A7 and A10 lengths than identical length TGFBR2 constructs but comparably lower at the A13 length, suggesting influences from both microsatellite length as well as the sequence context. Furthermore, the TGFBR2 A13 construct mutated into 33% A11 sequences (-2 bp) in addition to expected A12 (-1 bp), indicating that this construct undergoes continual subsequent frameshift mutation. These data demonstrate experimentally that both the length of a mononucleotide microsatellite and its sequence context influence mutation rate in defective DNA MMR.