Plant Proteins Are Smaller Because They Are Encoded by Fewer Exons than Animal Proteins.
ABSTRACT: Protein size is an important biochemical feature since longer proteins can harbor more domains and therefore can display more biological functionalities than shorter proteins. We found remarkable differences in protein length, exon structure, and domain count among different phylogenetic lineages. While eukaryotic proteins have an average size of 472 amino acid residues (aa), average protein sizes in plant genomes are smaller than those of animals and fungi. Proteins unique to plants are ?81aa shorter than plant proteins conserved among other eukaryotic lineages. The smaller average size of plant proteins could neither be explained by endosymbiosis nor subcellular compartmentation nor exon size, but rather due to exon number. Metazoan proteins are encoded on average by ?10 exons of small size [?176 nucleotides (nt)]. Streptophyta have on average only ?5.7 exons of medium size (?230nt). Multicellular species code for large proteins by increasing the exon number, while most unicellular organisms employ rather larger exons (>400nt). Among subcellular compartments, membrane proteins are the largest (?520aa), whereas the smallest proteins correspond to the gene ontology group of ribosome (?240aa). Plant genes are encoded by half the number of exons and also contain fewer domains than animal proteins on average. Interestingly, endosymbiotic proteins that migrated to the plant nucleus became larger than their cyanobacterial orthologs. We thus conclude that plants have proteins larger than bacteria but smaller than animals or fungi. Compared to the average of eukaryotic species, plants have ?34% more but ?20% smaller proteins. This suggests that photosynthetic organisms are unique and deserve therefore special attention with regard to the evolutionary forces acting on their genomes and proteomes.
Project description:To investigate the distribution of intron-exon structures of eukaryotic genes, we have constructed a general exon database comprising all available intron-containing genes and exon databases from 10 eukaryotic model organisms: Homo sapiens, Mus musculus, Gallus gallus, Rattus norvegicus, Arabidopsis thaliana, Zea mays, Schizosaccharomyces pombe, Aspergillus, Caenorhabditis elegans and Drosophila. We purged redundant genes to avoid the possible bias brought about by redundancy in the databases. After discarding those questionable introns that do not contain correct splice sites, the final database contained 17 102 introns, 21 019 exons and 2903 independent or quasi-independent genes. On average, a eukaryotic gene contains 3.7 introns per kb protein coding region. The exon distribution peaks around 30-40 residues and most introns are 40-125 nt long. The variable intron-exon structures of the 10 model organisms reveal two interesting statistical phenomena, which cast light on some previous speculations. (i) Genome size seems to be correlated with total intron length per gene. For example, invertebrate introns are smaller than those of human genes, while yeast introns are shorter than invertebrate introns. However, this correlation is weak, suggesting that other factors besides genome size may also affect intron size. (ii) Introns smaller than 50 nt are significantly less frequent than longer introns, possibly resulting from a minimum intron size requirement for intron splicing.
Project description:BACKGROUND: The origin and importance of exon-intron architecture comprises one of the remaining mysteries of gene evolution. Several studies have investigated the variations of intron length, GC content, ordinal position in a gene and divergence. However, there is little study about the structural variation of exons and introns. RESULTS: We investigated the length, GC content, ordinal position and divergence in both exons and introns of 13 eukaryotic genomes, representing plant and animal. Our analyses revealed that three basic patterns of exon-intron variation were present in nearly all analyzed genomes (P < 0.001 in most cases): an ordinal reduction of length and divergence in both exon and intron, a co-variation between exon and its flanking introns in their length, GC content and divergence, and a decrease of average exon (or intron) length, GC content and divergence as the total exon numbers of a gene increased. In addition, we observed that the shorter introns had either low or high GC content, and the GC content of long introns was intermediate. CONCLUSION: Although the factors contributing to these patterns have not been identified, our results provide three important clues: common factor(s) exist and may shape both exons and introns; the ordinal reduction patterns may reflect a time-orderly evolution; and the larger first and last exons may be splicing-required. These clues provide a framework for elucidating mechanisms involved in the organization of eukaryotic genomes and particularly in building exon-intron structures.
Project description:A full-length inducible nitric oxide synthase (iNOS) gene has been sequenced for the first time outside the mammals, and the gene organization compared with that already determined for human iNOS. While there are some differences from the human gene, overall the exons show remarkable conservation in sequence and organization. As in human, the trout iNOS gene has 27 exons, with 18 of the trout exons being identical in size with the equivalent human exons. The cofactor-binding domains are found in the same exons and in some cases are absolutely conserved. Differences include the start of the ORF in exon 3 instead of exon 2, resulting in a deletion at the 5' end of the trout iNOS protein. Exon 27 also shows a large difference in size and although the trout exon is larger this is due to the length of the 3'-UTR. Several non-mammalian features are notable, and include a conserved potential glycosylation site in chicken and fish, and an insertion at the boundary of exons 20 and 21 in fish. The intron sizes in trout were generally much smaller than in human iNOS, making the trout iNOS gene approximately half the size of the human gene. Analysis of RNA secondary structure revealed two regions with complementarity, which could interfere with reverse transcription. Using a trout fibroblast cell line (RTG-2 cells), it was shown by reverse transcriptase (RT)-PCR that virus infection was a good inducer of iNOS expression. However, when using a combination of Superscripttrade mark II for reverse transcription and primers at the 5' end of the gene only very weak products were amplified, in contrast with the situation when primers at the 3' end of the gene were used, or ThermoScripttrade mark-derived cDNA was used. The impact of such results on RT-PCR analysis of iNOS expression in trout is discussed.
Project description:BACKGROUND AND AIMS:Is there selection minimizing the costs of ovule production? Such selection should lead to a smaller ovule size in relation to seed size and, at the same time, smaller variation in ovule size within plants, the latter because the minimum structures and resources for functioning of ovules should be the same among ovules. Additionally, within species, ovule size should not depend on the plant's resource status. METHODS:To confirm these predictions, we examined ovule and seed production for a variety of species. KEY RESULTS:Among the 27 species studied, we found a significant negative dependence of the species mean of the coefficient of variation for plant ovule size on the ratio of the mean species seed size/mean species ovule size. Thus, the smaller the ovule size as compared with seed size, the smaller the degree of variation in ovule size. Among the 49 species studied, only two species showed significant positive dependence of mean ovule size on plant size. Although larger plants should have greater resources for ovule production, selection has not enhanced the production of large ovules in most species. CONCLUSIONS:These results suggest that there is selection minimizing the costs of ovule production.
Project description:The ribosome, as a catalyst for protein synthesis, is universal and essential for all organisms. Here we describe the structure of the genes encoding human ribosomal proteins (RPs) and compare this class of genes among several eukaryotes. Using genomic and full-length cDNA sequences, we characterized 73 RP genes and found that (1) transcription starts at a C residue within a characteristic oligopyrimidine tract; (2) the promoter region is GC rich, but often has a TATA box or similar sequence element; (3) the genes are small (4.4 kb), but have as many as 5.6 exons on average; (4) the initiator ATG is in the first or second exon and is within plus minus 5 bp of the first intron boundaries in about half of cases; and (5) 5'- and 3'-UTRs are significantly smaller (42 bp and 56 bp, respectively) than the genome average. Comparison of RP genes from humans, Drosophila melanogaster, Caenorhabditis elegans, and Saccharomyces cerevisiae revealed the coding sequences to be highly conserved (63% homology on average), although gene size and the number of exons vary. The positions of the introns are also conserved among these species as follows: 44% of human introns are present at the same position in either D. melanogaster or C. elegans, suggesting RP genes are highly suitable for studying the evolution of introns.
Project description:Expansins are cell-wall-loosening proteins that induce stress relaxation and extension of plant cell walls. To evaluate their hypothesized role in cell growth, we genetically manipulated expansin gene expression in Arabidopsis thaliana and assessed the consequent changes in growth and cell-wall properties. Various combinations of promoters were used to drive antisense and sense sequences of AtEXP10, which is maximally expressed in the growing leaf and at the base of the pedicel. Compared with controls, antisense lines had smaller rosettes because of shorter petioles and leaf blades and often acquired a twisted leaf morphology. Petiole cells from antisense plants were smaller than controls and their cell walls were significantly less extensible in vitro. Sense plants had slightly longer petioles, larger leaf blades, and larger cells than controls. Abscission at the base of the pedicel, where AtEXP10 is endogenously expressed, was enhanced in sense plants but reduced in antisense lines. These results support the concept that expansins function endogenously as cell-wall-loosening agents and indicate that expansins have versatile developmental roles that include control of organ size, morphology, and abscission.
Project description:BACKGROUND: The sizes of proteins are relevant to their biochemical structure and for their biological function. The statistical distribution of protein lengths across a diverse set of taxa can provide hints about the evolution of proteomes. RESULTS: Using the full genomic sequences of over 1,302 prokaryotic and 140 eukaryotic species two datasets containing 1.2 and 6.1 million proteins were generated and analyzed statistically. The lengthwise distribution of proteins can be roughly described with a gamma type or log-normal model, depending on the species. However the shape parameter of the gamma model has not a fixed value of 2, as previously suggested, but varies between 1.5 and 3 in different species. A gamma model with unrestricted shape parameter described best the distributions in ~48% of the species, whereas the log-normal distribution described better the observed protein sizes in 42% of the species. The gamma restricted function and the sum of exponentials distribution had a better fitting in only ~5% of the species. Eukaryotic proteins have an average size of 472 aa, whereas bacterial (320 aa) and archaeal (283 aa) proteins are significantly smaller (33-40% on average). Average protein sizes in different phylogenetic groups were: Alveolata (628 aa), Amoebozoa (533 aa), Fornicata (543 aa), Placozoa (453 aa), Eumetazoa (486 aa), Fungi (487 aa), Stramenopila (486 aa), Viridiplantae (392 aa). Amino acid composition is biased according to protein size. Protein length correlated negatively with %C, %M, %K, %F, %R, %W, %Y and positively with %D, %E, %Q, %S and %T. Prokaryotic proteins had a different protein size bias for %E, %G, %K and %M as compared to eukaryotes. CONCLUSIONS: Mathematical modeling of protein length empirical distributions can be used to asses the quality of small ORFs annotation in genomic releases (detection of too many false positive small ORFs). There is a negative correlation between average protein size and total number of proteins among eukaryotes but not in prokaryotes. The %GC content is positively correlated to total protein number and protein size in prokaryotes but not in eukaryotes. Small proteins have a different amino acid bias than larger proteins. Compared to prokaryotic species, the evolution of eukaryotic proteomes was characterized by increased protein number (massive gene duplication) and substantial changes of protein size (domain addition/subtraction).
Project description:BACKGROUND: A positive relationship between genome size and intron length is observed across eukaryotes including Angiosperms plants, indicating a co-evolution of genome size and gene structure. Conifers have very large genomes and longer introns on average than most plants, but impacts of their large genome and longer introns on gene structure has not be described. RESULTS: Gene structure was analyzed for 35 genes of Picea glauca obtained from BAC sequencing and genome assembly, including comparisons with A. thaliana, P. trichocarpa and Z. mays. We aimed to develop an understanding of impact of long introns on the structure of individual genes. The number and length of exons was well conserved among the species compared but on average, P. glauca introns were longer and genes had four times more intronic sequence than Arabidopsis, and 2 times more than poplar and maize. However, pairwise comparisons of individual genes gave variable results and not all contrasts were statistically significant. Genes generally accumulated one or a few longer introns in species with larger genomes but the position of long introns was variable between plant lineages. In P. glauca, highly expressed genes generally had more intronic sequence than tissue preferential genes. Comparisons with the Pinus taeda BACs and genome scaffolds showed a high conservation for position of long introns and for sequence of short introns. A survey of 1836 P. glauca genes obtained by sequence capture mostly containing introns <1 Kbp showed that repeated sequences were 10× more abundant in introns than in exons. CONCLUSION: Conifers have large amounts of intronic sequence per gene for seed plants due to the presence of few long introns and repetitive element sequences are ubiquitous in their introns. Results indicate a complex landscape of intron sizes and distribution across taxa and between genes with different expression profiles.
Project description:Very short exons, also known as micro-exons, occur in large numbers in some eukaryotic genomes. Existing annotation tools have a limited ability to recognize these short sequences, which range in length up to 25 bp. Here, we describe a computational method for the identification of micro-exons using near-perfect alignments between cDNA and genomic DNA sequences. Using this method, we detected 319 micro-exons in 4 complete genomes, of which 224 were previously unknown, human (170), the nematode Caenorhabditis elegans (4), the fruit fly Drosophila melanogaster (14), and the mustard plant Arabidopsis thaliana (36). Comparison of our computational method with popular cDNA alignment programs shows that the new algorithm is both efficient and accurate. The algorithm also aids in the discovery of micro-exon-skipping events and cross-species micro-exon conservation.
Project description:Non-coding mutations can create splice sites, however the true extent of how such somatic non-coding mutations affect RNA splicing are largely unexplored. Here we use the MiSplice pipeline to analyze 783 cancer cases with WGS data and 9494 cases with WES data, discovering 562 non-coding mutations that lead to splicing alterations. Notably, most of these mutations create new exons. Introns associated with new exon creation are significantly larger than the genome-wide average intron size. We find that some mutation-induced splicing alterations are located in genes important in tumorigenesis (ATRX, BCOR, CDKN2B, MAP3K1, MAP3K4, MDM2, SMAD4, STK11, TP53 etc.), often leading to truncated proteins and affecting gene expression. The pattern emerging from these exon-creating mutations suggests that splice sites created by non-coding mutations interact with pre-existing potential splice sites that originally lacked a suitable splicing pair to induce new exon formation. Our study suggests the importance of investigating biological and clinical consequences of noncoding splice-inducing mutations that were previously neglected by conventional annotation pipelines. MiSplice will be useful for automatically annotating the splicing impact of coding and non-coding mutations in future large-scale analyses.