ABSTRACT: In comparative genomic studies, syntenic groups of homologous sequence in the same order have been used as supplementary information that can be used in helping to determine the orthology of the compared sequences. The assumption is that orthologous gene copies are more likely to share the same genome positions and share the same gene neighbors. In this study we have defined positional homologs as those that also have homologous neighboring genes and we investigated the usefulness of this distinction for bacterial comparative genomics. We considered the identification of positionaly homologous gene pairs in bacterial genomes using protein and DNA sequence level alignments and found that the positional homologs had on average relatively lower rates of substitution at the DNA level (synonymous substitutions) than duplicate homologs in different genomic locations, regardless of the level of protein sequence divergence (measured with non-synonymous substitution rate). Since gene order conservation can indicate accuracy of orthology assignments, we also considered the effect of imposing certain alignment quality requirements on the sensitivity and specificity of identification of protein pairs by BLAST and FASTA when neighboring information is not available and in comparisons where gene order is not conserved. We found that the addition of a stringency filter based on the second best hits was an efficient way to remove dubious ortholog identifications in BLAST and FASTA analyses. Gene order conservation and DNA sequence homology are useful to consider in comparative genomic studies as they may indicate different orthology assignments than protein sequence homology alone.
Project description:BACKGROUND: Orthology is a central tenet of comparative genomics and ortholog identification is instrumental to protein function prediction. Major advances have been made to determine orthology relations among a set of homologous proteins. However, they depend on the comparison of individual sequences and do not take into account divergent orthologs. RESULTS: We have developed an iterative orthology prediction method, Ortho-Profile, that uses reciprocal best hits at the level of sequence profiles to infer orthology. It increases ortholog detection by 20% compared to sequence-to-sequence comparisons. Ortho-Profile predicts 598 human orthologs of mitochondrial proteins from Saccharomyces cerevisiae and Schizosaccharomyces pombe with 94% accuracy. Of these, 181 were not known to localize to mitochondria in mammals. Among the predictions of the Ortho-Profile method are 11 human cytochrome c oxidase (COX) assembly proteins that are implicated in mitochondrial function and disease. Their co-expression patterns, experimentally verified subcellular localization, and co-purification with human COX-associated proteins support these predictions. For the human gene C12orf62, the ortholog of S. cerevisiae COX14, we specifically confirm its role in negative regulation of the translation of cytochrome c oxidase. CONCLUSIONS: Divergent homologs can often only be detected by comparing sequence profiles and profile-based hidden Markov models. The Ortho-Profile method takes advantage of these techniques in the quest for orthologs.
Project description:BACKGROUND: Identification of homologous regions or conserved syntenies across genomes is one crucial step in comparative genomics. This task is usually performed by genome alignment softwares like WABA or blastz. In case of conserved syntenies, such regions are defined as conserved gene orders. On the gene order level, homologous regions can even be found between distantly related genomes, which do not align on the nucleotide sequence level. RESULTS: We present a novel approach to identify regions of conserved synteny across multiple genomes. Syntenator represents genomes and alignments thereof as partial order graphs (POGs). These POGs are aligned by a dynamic programming approach employing a gene-specific scoring function. The scoring function reflects the level of protein sequence similarity for each possible gene pair. Our method consistently defines larger homologous regions in pairwise gene order alignments than nucleotide-level comparisons. Our method is superior to methods that work on predefined homology gene sets (as implemented in Blockfinder). Syntenator successfully reproduces 80% of the EnsEMBL man-mouse conserved syntenic blocks. The full potential of our method becomes visible by comparing remotely related genomes and multiple genomes. Gene order alignments potentially resolve up to 75% of the EnsEMBL 1:many orthology relations and 27% of the many:many orthology relations. CONCLUSION: We propose Syntenator as a software solution to reliably infer conserved syntenies among distantly related genomes. The software is available from http://www2.tuebingen.mpg.de/abt4/plone.
Project description:The elucidation of orthology relationships is an important step both in gene function prediction as well as towards understanding patterns of sequence evolution. Orthology assignments are usually derived directly from sequence similarities for large data because more exact approaches exhibit too high computational costs. Here we present PoFF, an extension for the standalone tool Proteinortho, which enhances orthology detection by combining clustering, sequence similarity, and synteny. In the course of this work, FFAdj-MCS, a heuristic that assesses pairwise gene order using adjacencies (a similarity measure related to the breakpoint distance) was adapted to support multiple linear chromosomes and extended to detect duplicated regions. PoFF largely reduces the number of false positives and enables more fine-grained predictions than purely similarity-based approaches. The extension maintains the low memory requirements and the efficient concurrency options of its basis Proteinortho, making the software applicable to very large datasets.
Project description:Multiple sequence alignments are essential in homology inference, structure modeling, functional prediction and phylogenetic analysis. We developed a web server that constructs multiple protein sequence alignments using PROMALS, a progressive method that improves alignment quality by using additional homologs from PSI-BLAST searches and secondary structure predictions from PSIPRED. PROMALS shows higher alignment accuracy than other advanced methods, such as MUMMALS, ProbCons, MAFFT and SPEM. The PROMALS web server takes FASTA format protein sequences as input. The output includes a colored alignment augmented with information about sequence grouping, predicted secondary structures and positional conservation. The PROMALS web server is available at: http://prodata.swmed.edu/promals/
Project description:<h4>Background</h4>Large amino acid transporter gene families were identified from the genome sequences of three parasitic protists, Trypanosoma brucei, Trypanosoma cruzi and Leishmania major. These genes encode molecular sensors of the external host environment for trypanosomatid cells and are crucial to modulation of gene expression as the parasite passes through different life stages. This study provides a comprehensive phylogenetic account of the origins of these genes, redefining each locus according to a positional criterion, through the integration of phyletic identity with comparative gene order information.<h4>Results</h4>Each locus was individually specified by its surrounding gene order and associated with homologs showing the same position ('homoeologs') in other species, where available. Bayesian and maximum likelihood phylogenies were in general agreement on systematic relationships and confirmed several 'orthology sets' of genes retained since divergence from the common ancestor. Reconciliation analysis quantified the scale of duplication and gene loss, as well as identifying further apparent orthology sets, which lacked conservation of genomic position. These instances suggested substantial genomic restructuring or transposition. Other analyses identified clear instances of evolutionary rate changes post-duplication, the effects of concerted evolution within tandem gene arrays and gene conversion events between syntenic loci.<h4>Conclusion</h4>Despite their importance to cell function and parasite development, the repertoires of AAT loci in trypanosomatid parasites are relatively fluid in both complement and gene dosage. Some loci are ubiquitous and, after an ancient origin through transposition, originated through descent from the ancestral trypanosomatid. However, reconciliation analysis demonstrated that unilateral expansions of gene number through tandem gene duplication, transposition of gene duplicates to otherwise well conserved genomic positions, and differential patterns of gene loss have produced largely customised and idiosyncratic AAT repertoires in all three species. Not least in T. brucei, which seems to have retained fewer ancestral loci and has acquired novel loci through a complex mix of tandem and transpositive duplication.
Project description:BACKGROUND: Accurate determination of orthology is central to comparative genomics. For vertebrates in particular, very large gene families, high rates of gene duplication and loss, multiple mechanisms of gene duplication, and high rates of retrotransposition all combine to make inference of orthology between genes difficult. Many methods have been developed to identify orthologous genes, mostly based upon analysis of the inferred protein sequence of the genes. More recently, methods have been proposed that use genomic context in addition to protein sequence to improve orthology assignment in vertebrates. Such methods have been most successfully implemented in fungal genomes and have long been used in prokaryotic genomes, where gene order is far less variable than in vertebrates. However, to our knowledge, no explicit comparison of synteny and sequence based definitions of orthology has been reported in vertebrates, or, more specifically, in mammals. RESULTS: We test a simple method for the measurement and utilization of gene order (local synteny) in the identification of mammalian orthologs by investigating the agreement between coding sequence based orthology (Inparanoid) and local synteny based orthology. In the 5 mammalian genomes studied, 93% of the sampled inter-species pairs were found to be concordant between the two orthology methods, illustrating that local synteny is a robust substitute to coding sequence for identifying orthologs. However, 7% of pairs were found to be discordant between local synteny and Inparanoid. These cases of discordance result from evolutionary events including retrotransposition and genome rearrangements. CONCLUSIONS: By analyzing cases of discordance between local synteny and Inparanoid we show that local synteny can distinguish between true orthologs and recent retrogenes, can resolve ambiguous many-to-many orthology relationships into one-to-one ortholog pairs, and might be used to identify cases of non-orthologous gene displacement by retroduplicated paralogs.
Project description:Over the last decades, vast numbers of sequences were deposited in public databases. Bioinformatics tools allow homology and consequently functional inference for these sequences. New profile-based homology search tools have been introduced, allowing reliable detection of remote homologs, but have not been systematically benchmarked. To provide such a comparison, which can guide bioinformatics workflows, we extend and apply our previously developed benchmark approach to evaluate the 'next generation' of profile-based approaches, including CS-BLAST, HHSEARCH and PHMMER, in comparison with the non-profile based search tools NCBI-BLAST, USEARCH, UBLAST and FASTA.We generated challenging benchmark datasets based on protein domain architectures within either the PFAM?+?Clan, SCOP/Superfamily or CATH/Gene3D domain definition schemes. From each dataset, homologous and non-homologous protein pairs were aligned using each tool, and standard performance metrics calculated. We further measured congruence of domain architecture assignments in the three domain databases.CSBLAST and PHMMER had overall highest accuracy. FASTA, UBLAST and USEARCH showed large trade-offs of accuracy for speed optimization.Profile methods are superior at inferring remote homologs but the difference in accuracy between methods is relatively small. PHMMER and CSBLAST stand out with the highest accuracy, yet still at a reasonable computational cost. Additionally, we show that less than 0.1% of Swiss-Prot protein pairs considered homologous by one database are considered non-homologous by another, implying that these classifications represent equivalent underlying biological phenomena, differing mostly in coverage and granularity.Benchmark datasets and all scripts are placed at (http://sonnhammer.org/download/Homology_benchmark).email@example.comSupplementary data are available at Bioinformatics online.
Project description:BACKGROUND:The gene family-free framework for comparative genomics aims at providing methods for gene order analysis that do not require prior gene family assignment, but work directly on a sequence similarity graph. We study two problems related to the breakpoint median of three genomes, which asks for the construction of a fourth genome that minimizes the sum of breakpoint distances to the input genomes. METHODS:We present a model for constructing a median of three genomes in this family-free setting, based on maximizing an objective function that generalizes the classical breakpoint distance by integrating sequence similarity in the score of a gene adjacency. We study its computational complexity and we describe an integer linear program (ILP) for its exact solution. We further discuss a related problem called family-free adjacencies for k genomes for the special case of [Formula: see text] and present an ILP for its solution. However, for this problem, the computation of exact solutions remains intractable for sufficiently large instances. We then proceed to describe a heuristic method, FFAdj-AM, which performs well in practice. RESULTS:The developed methods compute accurate positional orthologs for genomes comparable in size of bacterial genomes on simulated data and genomic data acquired from the OMA orthology database. In particular, FFAdj-AM performs equally or better when compared to the well-established gene family prediction tool MultiMSOAR. CONCLUSIONS:We study the computational complexity of a new family-free model and present algorithms for its solution. With FFAdj-AM, we propose an appealing alternative to established tools for identifying higher confidence positional orthologs.
Project description:The complete genome sequence of the nematode Caenorhabditis elegans provides an excellent basis for studying the distribution and evolution of protein families in higher eukaryotes. Three fundamental questions are as follows: How many paralog clusters exist in one species, how many of these are shared with other species, and how many proteins can be assigned a functional counterpart in other species? We have addressed these questions in a detailed study of predicted membrane proteins in C. elegans and their mammalian homologs. All worm proteins predicted to contain at least two transmembrane segments were clustered on the basis of sequence similarity. This resulted in 189 groups with two or more sequences, containing, in total, 2647 worm proteins. Hidden Markov models (HMMs) were created for each family, and were used to retrieve mammalian homologs from the SWISSPROT, TREMBL, and VTS databases. About one-half of these clusters had mammalian homologs. Putative worm-mammalian orthologs were extracted by use of nine different phylogenetic methods and BLAST. Eight clusters initially thought to be worm-specific were assigned mammalian homologs after searching EST and genomic sequences. A compilation of 174 orthology assignments made with high confidence is presented.
Project description:Sequence similarity searches performed with BLAST, SSEARCH and FASTA achieve high sensitivity by using scoring matrices (e.g. BLOSUM62) that target low identity (<33%) alignments. Although such scoring matrices can effectively identify distant homologs, they can also produce local alignments that extend beyond the homologous regions.We measured local alignment start/stop boundary accuracy using a set of queries where the correct alignment boundaries were known, and found that 7% of BLASTP and 8% of SSEARCH alignment boundaries were overextended. Overextended alignments include non-homologous sequences; they occur most frequently between sequences that are more closely related (>33% identity). Adjusting the scoring matrix to reflect the identity of the homologous sequence can correct higher identity overextended alignment boundaries. In addition, the scoring matrix that produced a correct alignment could be reliably predicted based on the sequence identity seen in the original BLOSUM62 alignment. Realigning with the predicted scoring matrix corrected 37% of all overextended alignments, resulting in more correct alignments than using BLOSUM62 alone.