Project description:BackgroundWith the recent growth of information on sequence variations in the human genome, predictions regarding the functional effects and relevance to disease phenotypes of coding sequence variations are becoming increasingly important. The aims of this study were to catalog protein-coding sequence variations (CVs) occurring in genetic variation databases and to use bioinformatic programs to analyze CVs. In addition, we aim to provide insight into the functionality of the reference databases.Methodology and findingsTo catalog CVs on a genome-wide scale with regard to protein function and disease, we investigated three representative databases; the Human Gene Mutation Database (HGMD), the Single Nucleotide Polymorphisms database (dbSNP), and the Haplotype Map (HapMap). Using these three databases, we analyzed CVs at the protein function level with bioinformatic programs. We proposed a combinatorial approach using the Support Vector Machine (SVM) to increase the performance of the prediction programs. By cataloging the coding sequence variations using these databases, we found that 4.36% of CVs from HGMD are concurrently registered in dbSNP (8.11% of CVs from dbSNP are concurrent in HGMD). The pattern of substitutions and functional consequences predicted by three bioinformatic programs was significantly different among concurrent CVs, and CVs occurring solely in HGMD or in dbSNP. The experimental results showed that the proposed SVM combination noticeably outperformed the individual prediction programs.ConclusionsThis is the first study to compare human sequence variations in HGMD, dbSNP and HapMap at the genome-wide level. We found that a significant proportion of CVs in HGMD and dbSNP overlap, and we emphasize the need to use caution when interpreting the phenotypic relevance of these concurrent CVs. Combining bioinformatic programs can be helpful in predicting the functional consequences of CVs because it improved the performance of functional predictions.
Project description:Enterotoxigenic Escherichia coli (ETEC) is a leading cause of diarrheal disease in developing nations where it accounts for a significant disease burden in children between the ages of 0 to 59 months. It is also the number one bacterial causative agent of traveler's diarrhea. ETEC infects hosts through the fecal-oral route and utilizes colonization factors (CF) to adhere within the small intestine. Over 25 CFs have been identified; 7 are considered major CFs and a vaccine targeting these is predicted to provide protection against up to 66% of ETEC associated disease. Coli Surface Antigen 6 (CS6) is a major CF and is associated with disease-causing ETEC isolates. Analysis of the CS6 operon sequence led to the identification of two regions of variability among clinical isolates which we predicted exert effects on CS6 transcript and protein expression. A total of 7 recombinant E. coli strains were engineered to encode the CS6 operon in wild-type, hybrid, and mutant configurations. Western blot analysis and RT-qPCR provided evidence to support the importance of an intergenic hairpin structure on CS6 expression. Our results reveal the significance of CS6 sequence selection regarding ETEC vaccine development and present novel information regarding CS6 sequence variation in WT ETEC strains.
Project description:Apolipoprotein A1 (APOA1) is a potential biomarker because of its variable concentration in different types of cancers. The current study is the first of its kind to evaluate the association between the APOA1 genotypes of -75 G/A and +83 C/T in tandem with the APOA1 protein expression in urine samples to find out the risk and potential relationship for differentially expressed urinary proteins and APOA1 genotypes. The study included 108 cases of bladder tumors and 150 healthy controls that were frequency matched to cases with respect to age, sex, and smoking status. Genotyping was performed using PCR-RFLP and the urinary expression of the APOA1 protein was done using ELISA. Bladder tumor cases were significantly associated with the APOA1 -75 AA genotype (p < 0.05), while the APOA1 +83 C/T heterozygotes showed an association with cases (p < 0.05). The overall distribution of the different haplotypes showed a marked difference between the cases and controls in GT when compared with the wild type GC (p < 0.03). Bladder tumor cases that carried the variant genotype APOA1 -75AA were found more (70.0%) with a higher expression (≥20 ng/mL)of the APOA1 urinary protein and differed significantly against wild type GG (p = 0.03). Again, in low grade bladder tumors, urinary APOA1 protein was exhibited significantly more (52.4% vs. 15.4% high grade) with a higher expression (≥20 ng), while high grade tumor cases (84.6% vs. 47.5% low grade) showed a lower APOA1 expression (<20 ng/mL) (O.R = 6.08, p = 0.002). A strong association was observed between APOA1 -75G/A and risk for bladder tumor and its relation to urinary protein expression, which substantiates its possible role as a marker for the risk assessment of the disease and as a promising diagnostic marker for different grades of malignant bladder tumors.
Project description:It is commonly believed that similarities between the sequences of two proteins infer similarities between their structures. Sequence alignments reliably recognize pairs of protein of similar structures provided that the percentage sequence identity between their two sequences is sufficiently high. This distinction, however, is statistically less reliable when the percentage sequence identity is lower than 30% and little is known then about the detailed relationship between the two measures of similarity. Here, we investigate the inverse correlation between structural similarity and sequence similarity on 12 protein structure families. We define the structure similarity between two proteins as the cRMS distance between their structures. The sequence similarity for a pair of proteins is measured as the mean distance between the sequences in the subsets of sequence space compatible with their structures. We obtain an approximation of the sequence space compatible with a protein by designing a collection of protein sequences both stable and specific to the structure of that protein. Using these measures of sequence and structure similarities, we find that structural changes within a protein family are linearly related to changes in sequence similarity.
Project description:BackgroundFrameshift translation is an important phenomenon that contributes to the appearance of novel coding DNA sequences (CDS) and functions in gene evolution, by allowing alternative amino acid translations of gene coding regions. Frameshift translations can be identified by aligning two CDS, from a same gene or from homologous genes, while accounting for their codon structure. Two main classes of algorithms have been proposed to solve the problem of aligning CDS, either by amino acid sequence alignment back-translation, or by simultaneously accounting for the nucleotide and amino acid levels. The former does not allow to account for frameshift translations and up to now, the latter exclusively accounts for frameshift translation initiation, not considering the length of the translation disruption caused by a frameshift.ResultsWe introduce a new scoring scheme with an algorithm for the pairwise alignment of CDS accounting for frameshift translation initiation and length, while simultaneously considering nucleotide and amino acid sequences. The main specificity of the scoring scheme is the introduction of a penalty cost accounting for frameshift extension length to compute an adequate similarity score for a CDS alignment. The second specificity of the model is that the search space of the problem solved is the set of all feasible alignments between two CDS. Previous approaches have considered restricted search space or additional constraints on the decomposition of an alignment into length-3 sub-alignments. The algorithm described in this paper has the same asymptotic time complexity as the classical Needleman-Wunsch algorithm.ConclusionsWe compare the method to other CDS alignment methods based on an application to the comparison of pairs of CDS from homologous human, mouse and cow genes of ten mammalian gene families from the Ensembl-Compara database. The results show that our method is particularly robust to parameter changes as compared to existing methods. It also appears to be a good compromise, performing well both in the presence and absence of frameshift translations. An implementation of the method is available at https://github.com/UdeS-CoBIUS/FsePSA.
Project description:The vast amount of protein sequence data now available, together with accumulating experimental knowledge of protein function, enables modeling of protein sequence and function evolution. The PANTHER database was designed to model evolutionary sequence-function relationships on a large scale. There are a number of applications for these data, and we have implemented web services that address three of them. The first is a protein classification service. Proteins can be classified, using only their amino acid sequences, to evolutionary groups at both the family and subfamily levels. Specific subfamilies, and often families, are further classified when possible according to their functions, including molecular function and the biological processes and pathways they participate in. The second application, then, is an expression data analysis service, where functional classification information can help find biological patterns in the data obtained from genome-wide experiments. The third application is a coding single-nucleotide polymorphism scoring service. In this case, information about evolutionarily related proteins is used to assess the likelihood of a deleterious effect on protein function arising from a single substitution at a specific amino acid position in the protein. All three web services are available at http://www.pantherdb.org/tools.
Project description:To investigate the genetic basis of the Rh polypeptide gene, we attempted the isolation of cDNA clones for Rh polypeptide from a family with the RhD-positive and RhD-negative phenotypes using the reverse transcription (RT)-PCR method for each reticulocyte RNAs followed by subcloning. The isolated cDNAs showed the existence of another Rh-related clone (RhPII-1 cDNA, tentative designation) besides the RhPI and RhPII cDNA clones reported previously by us. The RhPII-1 cDNA had a single nucleotide substitution with one amino acid substitution compared with the RhPII cDNA:substitution C-->T in nucleotide 380, changing codon 127 from GCG to GTG (Ala-->Val). The RhPI, RhPII, and RhPII-1 cDNA clones were detected in all individuals by the PCR experiment. This suggests that the Rh polypeptide genes have been inherited from parents and might be highly polymorphic. The PCR amplification of an RhPII-specific region from reticulocyte RNA and genomic DNA in all the family proved that the RhPII gene exists in both RhD-positive and RhD-negative individuals. By Southern-blot analysis of the DNAs from the family, two independent polymorphisms concerning the RhC/c and RhD/d phenotypes were observed. These results demonstrate that the RhPI and RhPII genes are also present in the RhD-negative donors, and the RhPII-related cDNAs encode not the RhD, but the RhC/c and/or E/e, polypeptides.
Project description:D antigen is the most important and immunogenic antigen of the Rh blood group. The RhD-negative phenotype has different genetic backgrounds with variable distribution in different populations. Hybrid Rhesus box, resulting from RHD gene deletion, is used in genotyping studies of the Rh blood group as a marker to identify the RHD gene deletion. This study for the first time identified genetic mechanisms for the occurrence of RhD-negative phenotype among the Iranian population. 200 RhD-negative blood donors were randomly selected from Tehran Blood Transfusion Center. The phenotype of D, C, Ε, e and c antigens was serologically identified, and DNA was extracted from buffy coat. The molecular analysis of hybrid Rhesus box was performed by PCR-SSP and PCR-RFLP. Moreover, the presence of different exons of RHD gene was investigated by real-time PCR on extracted DNA. Hybrid Rhesus box was detected in all samples, and PCR-RFLP confirmed that 198 (99%) were homozygous for an RHD gene deletion and 2 were heterozygous for hybrid Rhesus box in which one (0.5%) had a weak D type 11 and the other one (0.5%) had a RHD-CE (2-9)-D 2 hybrid allele. Similar to Caucasians, the frequency of RHD gene deletion was high among the Iranian population studied in this investigation, so hybrid Rhesus box can be used as an efficient marker to detect RHD gene deletion in our population.
Project description:A frameshift (fs) mutation in the natriuretic peptide precursor A (NPPA) gene, encoding a mutant atrial natriuretic peptide (Mut-ANP), has been linked with familial atrial fibrillation (AF) but the underlying mechanisms by which the mutation causes AF remain unclear. We engineered 2 transgenic (TG) mouse lines expressing the wild-type (WT)-NPPA gene (H-WT-NPPA) and the human fs-Mut-NPPA gene (H-fsMut-NPPA) to test the hypothesis that mice overexpressing the human NPPA mutation are more susceptible to AF and elucidate the underlying electrophysiologic and molecular mechanisms. Transthoracic echocardiography and surface electrocardiography (ECG) were performed in H-fsMut-NPPA, H-WT-NPPA, and Non-TG mice. Invasive electrophysiology, immunohistochemistry, Western blotting and patch clamping of membrane potentials were performed. To examine the role of the Mut-ANP in ion channel remodeling, we measured plasma cyclic guanosine monophosphate (cGMP) and cyclic adenosine monophosphate (cAMP) levels and protein kinase A (PKA) activity in the 3 groups of mice. In H-fsMut-NPPA mice mean arterial pressure (MAP) was reduced when compared to H-WT-NPPA and Non-TG mice. Furthermore, injection of synthetic fs-Mut-ANP lowered the MAP in H-WT-NPPA and Non-TG mice while synthetic WT-ANP had no effect on MAP in the 3 groups of mice. ECG characterization revealed significantly prolonged QRS duration in H-fsMut-NPPA mice when compared to the other two groups. Trans-Esophageal (TE) atrial pacing of H-fsMut-NPPA mice showed increased AF burden and AF episodes when compared with H-WT-NPPA or Non-TG mice. The cardiac Na+ (NaV1.5) and Ca2+ (CaV1.2/CaV1.3) channel expression and currents (INa, ICaL) and action potential durations (APD90/APD50/APD20) were significantly reduced in H-fsMut-NPPA mice while the rectifier K+ channel current (IKs) was markedly increased when compared to the other 2 groups of mice. In addition, plasma cGMP levels were only increased in H-fsMut-NPPA mice with a corresponding reduction in plasma cAMP levels and PKA activity. In summary, we showed that mice overexpressing an AF-linked NPPA mutation are more prone to develop AF and this risk is mediated in part by remodeling of the cardiac Na+, Ca2+ and K+ channels creating an electrophysiologic substrate for reentrant AF.
Project description:BackgroundRNA viruses possess remarkable evolutionary versatility driven by the high mutability of their genomes. Frameshifting nucleotide insertions or deletions (indels), which cause the premature termination of proteins, are frequently observed in the coding sequences of various viral genomes. When a secondary indel occurs near the primary indel site, the open reading frame can be restored to produce functional proteins, a phenomenon known as the compensatory frameshift.ResultsIn this study, we systematically analyzed publicly available viral genome sequences and identified compensatory frameshift events in hundreds of viral protein-coding sequences. Compensatory frameshift events resulted in large-scale amino acid differences between the compensatory frameshift form and the wild type even though their nucleotide sequences were almost identical. Phylogenetic analyses revealed that the evolutionary distance between proteins with and without a compensatory frameshift were significantly overestimated because amino acid mismatches caused by compensatory frameshifts were counted as substitutions. Further, this could cause compensatory frameshift forms to branch in different locations in the protein and nucleotide trees, which may obscure the correct interpretation of phylogenetic relationships between variant viruses.ConclusionsOur results imply that the compensatory frameshift is one of the mechanisms driving the rapid protein evolution of RNA viruses and potentially assisting their host-range expansion and adaptation.