ABSTRACT: The de facto commoditization of biomolecular crystallography as a result of almost disruptive instrumentation automation and continuing improvement of software allows any sensibly trained structural biologist to conduct crystallographic studies of biomolecules with reasonably valid outcomes: that is, models based on properly interpreted electron density. Robust validation has led to major mistakes in the protein part of structure models becoming rare, but some depositions of protein-peptide complex structure models, which generally carry significant interest to the scientific community, still contain erroneous models of the bound peptide ligand. Here, the protein small-molecule ligand validation tool Twilight is updated to include peptide ligands. (i) The primary technical reasons and potential human factors leading to problems in ligand structure models are presented; (ii) a new method used to score peptide-ligand models is presented; (iii) a few instructive and specific examples, including an electron-density-based analysis of peptide-ligand structures that do not contain any ligands, are discussed in detail; (iv) means to avoid such mistakes and the implications for database integrity are discussed and (v) some suggestions as to how journal editors could help to expunge errors from the Protein Data Bank are provided.
Project description:Three-dimensional models of protein structures determined by X-ray crystallography are based on the interpretation of experimentally derived electron-density maps. The real-space correlation coefficient (RSCC) provides an easily comprehensible, objective measure of the residue-based fit of atom coordinates to electron density. Among protein structure models, protein-ligand complexes are of special interest, given their contribution to understanding the molecular underpinnings of biological activity and to drug design. For consumers of such models, it is not trivial to determine the degree to which ligand-structure modelling is biased by subjective electron-density interpretation. A standalone script, Twilight, is presented for the analysis, visualization and annotation of a pre-filtered set of 2815 protein-ligand complexes deposited with the PDB as of 15 January 2012 with ligand RSCC values that are below a threshold of 0.6. It also provides simplified access to the visualization of any protein-ligand complex available from the PDB and annotated by the Uppsala Electron Density Server. The script runs on various platforms and is available for download at http://www.ruppweb.org/twilight/.
Project description:Strong DNA conservation among divergent species is an indicator of enduring functionality. With weaker sequence conservation we enter a vast 'twilight zone' in which sequence subject to transient or lower constraint cannot be distinguished easily from neutrally evolving, non-functional sequence. Twilight zone functional sequence is illuminated instead by principles of selective constraint and positive selection using genomic data acquired from within a species' population. Application of these principles reveals that despite being biochemically active, most twilight zone sequence is not functional.
Project description:<h4>Abstract</h4>Studying individual flight behaviour throughout the year is indispensable to understand the ecology of a bird species. Recent development in technology allows now to track flight behaviour of small long-distance bird migrants throughout its annual cycle. The specific flight behaviour of twilight ascents in birds has been documented in a few studies, but only during a short period of the year, and never quantified on the individual level. It has been suggested that twilight ascents might be a role in orientation and navigation. Previous studies had reported the behaviour only near the breeding site and during migration. We investigated year-round flight behaviour of 34 individual Alpine swifts (<i>Apus melba</i>) of four different populations in relation to twilight ascents. We recorded twilight ascents all around the year and found a twofold higher frequency in ascents during the non-breeding residence phase in Africa compared to all other phases of the year. Dawn ascents were twice as common as dusk ascents and occurred mainly when atmospheric conditions remained stable over a 24-h period. We found no conclusive support that twilight ascents are essential for recalibration of compass cues and landmarks. Data on the wing flapping intensity revealed that high activity at twilight occurred more regularly than the ascents. We therefore conclude that alpine swift generally increase flight activity-also horizontal flight-during the twilight period and we suppose that this increased flight activity, including ascents, might be part of social interactions between individuals.<h4>Significance statement</h4>Year-round flight altitude tracking with a light-weight multi-sensor tag reveals that Alpine swifts ascend several hundred meters high at twilight regularly. The reason for this behaviour remains unclear and the low-light conditions at this time of the day preclude foraging as a possibility. The frequency and altitude of twilight ascents were highest during the non-breeding period, intermediate during migration and low for active breeders during the breeding phase. We discuss our findings in the context of existing hypotheses on twilight ascent and we propose an additional hypothesis which links twilight ascent with social interaction between flock members. Our study highlights how flight behaviour of individuals of a migratory bird species can be studied even during the sparsely documented non-breeding period.
Project description:Sequence alignment of proteins and nucleic acids is a routine task in bioinformatics. Although the comparison of complete peptides, genes or genomes can be undertaken with a great variety of tools, the alignment of short DNA sequences and motifs entails pitfalls that have not been fully addressed yet. Here we confront the structural superposition of transcription factors with the sequence alignment of their recognized cis elements. Our goals are (i) to test TFcompare (http://floresta.eead.csic.es/tfcompare), a structural alignment method for protein-DNA complexes; (ii) to benchmark the pairwise alignment of regulatory elements; (iii) to define the confidence limits and the twilight zone of such alignments and (iv) to evaluate the relevance of these thresholds with elements obtained experimentally. We find that the structure of cis elements and protein-DNA interfaces is significantly more conserved than their sequence and measures how this correlates with alignment errors when only sequence information is considered. Our results confirm that DNA motifs in the form of matrices produce better alignments than individual sequences. Finally, we report that empirical and theoretically derived twilight thresholds are useful for estimating the natural plasticity of regulatory sequences, and hence for filtering out unreliable alignments.
Project description:BACKGROUND: A widely used method to find conserved secondary structure in RNA is to first construct a multiple sequence alignment, and then fold the alignment, optimizing a score based on thermodynamics and covariance. This method works best around 75% sequence similarity. However, in a "twilight zone" below 55% similarity, the sequence alignment tends to obscure the covariance signal used in the second phase. Therefore, while the overall shape of the consensus structure may still be found, the degree of conservation cannot be estimated reliably. RESULTS: Based on a combination of available methods, we present a method named planACstar for improving structure conservation in structural alignments in the twilight zone. After constructing a consensus structure by alignment folding, planACstar abandons the original sequence alignment, refolds the sequences individually, but consistent with the consensus, aligns the structures, irrespective of sequence, by a pure structure alignment method, and derives an improved sequence alignment from the alignment of structures, to be re-submitted to alignment folding, etc.. This circle may be iterated as long as structural conservation improves, but normally, one step suffices. CONCLUSIONS: Employing the tools ClustalW, RNAalifold, and RNAforester, we find that for sequences with 30-55% sequence identity, structural conservation can be improved by 10% on average, with a large variation, measured in terms of RNAalifold's own criterion, the structure conservation index.
Project description:Phylogenetic tree reconstruction is traditionally based on multiple sequence alignments (MSAs) and heavily depends on the validity of this information bottleneck. With increasing sequence divergence, the quality of MSAs decays quickly. Alignment-free methods, on the other hand, are based on abstract string comparisons and avoid potential alignment problems. However, in general they are not biologically motivated and ignore our knowledge about the evolution of sequences. Thus, it is still a major open question how to define an evolutionary distance metric between divergent sequences that makes use of indel information and known substitution models without the need for a multiple alignment. Here we propose a new evolutionary distance metric to close this gap. It uses finite-state transducers to create a biologically motivated similarity score which models substitutions and indels, and does not depend on a multiple sequence alignment. The sequence similarity score is defined in analogy to pairwise alignments and additionally has the positive semi-definite property. We describe its derivation and show in simulation studies and real-world examples that it is more accurate in reconstructing phylogenies than competing methods. The result is a new and accurate way of determining evolutionary distances in and beyond the twilight zone of sequence alignments that is suitable for large datasets.
Project description:we characterized the microbial communities and proteomes of POC collected from the twilight zone at three contrasting sites in the northwest Pacific Ocean using a metaproteomic approach.Particle-attached bacteria, Alteromonadales, Rhodobacterales and Enterobacteriales, were the major remineralizers of POC in the twilight zone.
Project description:BACKGROUND: Protein structure prediction methods provide accurate results when a homologous protein is predicted, while poorer predictions are obtained in the absence of homologous templates. However, some protein chains that share twilight-zone pairwise identity can form similar folds and thus determining structural similarity without the sequence similarity would be desirable for the structure prediction. The folding type of a protein or its domain is defined as the structural class. Current structural class prediction methods that predict the four structural classes defined in SCOP provide up to 63% accuracy for the datasets in which sequence identity of any pair of sequences belongs to the twilight-zone. We propose SCPRED method that improves prediction accuracy for sequences that share twilight-zone pairwise similarity with sequences used for the prediction. RESULTS: SCPRED uses a support vector machine classifier that takes several custom-designed features as its input to predict the structural classes. Based on extensive design that considers over 2300 index-, composition- and physicochemical properties-based features along with features based on the predicted secondary structure and content, the classifier's input includes 8 features based on information extracted from the secondary structure predicted with PSI-PRED and one feature computed from the sequence. Tests performed with datasets of 1673 protein chains, in which any pair of sequences shares twilight-zone similarity, show that SCPRED obtains 80.3% accuracy when predicting the four SCOP-defined structural classes, which is superior when compared with over a dozen recent competing methods that are based on support vector machine, logistic regression, and ensemble of classifiers predictors. CONCLUSION: The SCPRED can accurately find similar structures for sequences that share low identity with sequence used for the prediction. The high predictive accuracy achieved by SCPRED is attributed to the design of the features, which are capable of separating the structural classes in spite of their low dimensionality. We also demonstrate that the SCPRED's predictions can be successfully used as a post-processing filter to improve performance of modern fold classification methods.
Project description:BACKGROUND: Knowledge of structural class is used by numerous methods for identification of structural/functional characteristics of proteins and could be used for the detection of remote homologues, particularly for chains that share twilight-zone similarity. In contrast to existing sequence-based structural class predictors, which target four major classes and which are designed for high identity sequences, we predict seven classes from sequences that share twilight-zone identity with the training sequences. RESULTS: The proposed MODular Approach to Structural class prediction (MODAS) method is unique as it allows for selection of any subset of the classes. MODAS is also the first to utilize a novel, custom-built feature-based sequence representation that combines evolutionary profiles and predicted secondary structure. The features quantify information relevant to the definition of the classes including conservation of residues and arrangement and number of helix/strand segments. Our comprehensive design considers 8 feature selection methods and 4 classifiers to develop Support Vector Machine-based classifiers that are tailored for each of the seven classes. Tests on 5 twilight-zone and 1 high-similarity benchmark datasets and comparison with over two dozens of modern competing predictors show that MODAS provides the best overall accuracy that ranges between 80% and 96.7% (83.5% for the twilight-zone datasets), depending on the dataset. This translates into 19% and 8% error rate reduction when compared against the best performing competing method on two largest datasets. The proposed predictor provides accurate predictions at 58% accuracy for membrane proteins class, which is not considered by majority of existing methods, in spite that this class accounts for only 2% of the data. Our predictive model is analyzed to demonstrate how and why the input features are associated with the corresponding classes. CONCLUSIONS: The improved predictions stem from the novel features that express collocation of the secondary structure segments in the protein sequence and that combine evolutionary and secondary structure information. Our work demonstrates that conservation and arrangement of the secondary structure segments predicted along the protein chain can successfully predict structural classes which are defined based on the spatial arrangement of the secondary structures. A web server is available at http://biomine.ece.ualberta.ca/MODAS/.
Project description:MOTIVATION:The computational modeling of peptide display by class I major histocompatibility complexes (MHCs) is essential for peptide-based therapeutics design. Existing computational methods for peptide-display focus on modeling the peptide-MHC-binding affinity. However, such models are not able to characterize the sequence features for the other cellular processes in the peptide display pathway that determines MHC ligand selection. RESULTS:We introduce a semi-supervised model, DeepLigand that outperforms the state-of-the-art models in MHC Class I ligand prediction. DeepLigand combines a peptide language model and peptide binding affinity prediction to score MHC class I peptide presentation. The peptide language model characterizes sequence features that correspond to secondary factors in MHC ligand selection other than binding affinity. The peptide embedding is learned by pre-training on natural ligands, and can discriminate between ligands and non-ligands in the absence of binding affinity prediction. Although conventional affinity-based models fail to classify peptides with moderate affinities, DeepLigand discriminates ligands from non-ligands with consistently high accuracy. AVAILABILITY AND IMPLEMENTATION:We make DeepLigand available at https://github.com/gifford-lab/DeepLigand. SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.