Accurate annotation of human protein-coding small open reading frames.
ABSTRACT: Functional protein-coding small open reading frames (smORFs) are emerging as an important class of genes. However, the number of translated smORFs in the human genome is unclear because proteogenomic methods are not sensitive enough, and, as we show, Ribo-seq strategies require additional measures to ensure comprehensive and accurate smORF annotation. Here, we integrate de novo transcriptome assembly and Ribo-seq into an improved workflow that overcomes obstacles with previous methods, to more confidently annotate thousands of smORFs. Evolutionary conservation analyses suggest that hundreds of smORF-encoded microproteins are likely functional. Additionally, many smORFs are regulated during fundamental biological processes, such as cell stress. Peptides derived from smORFs are also detectable on human leukocyte antigen complexes, revealing smORFs as a source of antigens. Thus, by including additional validation into our smORF annotation workflow, we accurately identify thousands of unannotated translated smORFs that will provide a rich pool of unexplored, functional human genes.
Project description:Protein-coding small open reading frames (smORFs) are emerging as an important class of genes, however, the coding capacity of smORFs in the human genome is unclear. By integrating de novo transcriptome assembly and Ribo-Seq, we confidently annotate thousands of novel translated smORFs in three human cell lines. We find that smORF translation prediction is noisier than for annotated coding sequences, underscoring the importance of analyzing multiple experiments and footprinting conditions. These smORFs are located within non-coding and antisense transcripts, the UTRs of mRNAs, and unannotated transcripts. Analysis of RNA levels and translation efficiency during cellular stress identifies regulated smORFs and provides an approach for identifying smORFs for further investigation. Sequence conservation and signatures of positive selection indicate that encoded microproteins are likely functional. Additionally, proteomics data from enriched human leukocyte antigen complexes validates the translation of hundreds of smORFs and positions them as a source of novel antigens. Thus, smORFs represent a significant number of important, yet unexplored human genes. Overall design: Annotation of protein-coding smORFs in 3 human cell lines using de novo transcriptome assembly and Ribo-Seq.
Project description:<h4>Background</h4>Small open reading frame (smORF) is open reading frame with a length of less than 100 codons. Microproteins, translated from smORFs, have been found to participate in a variety of biological processes such as muscle formation and contraction, cell proliferation, and immune activation. Although previous studies have collected and annotated a large abundance of smORFs, functions of the vast majority of smORFs are still unknown. It is thus increasingly important to develop computational methods to annotate the functions of these smORFs.<h4>Results</h4>In this study, we collected 617,462 unique smORFs from three studies. The expression of smORF RNAs was estimated by reannotated microarray probes. Using a speed-optimized correlation algorism, the functions of smORFs were predicted by their correlated genes with known functional annotations. After applying our method to 5 known microproteins from literatures, our method successfully predicted their functions. Further validation from the UniProt database showed that at least one function of 202 out of 270 microproteins was predicted.<h4>Conclusions</h4>We developed a method, smORFunction, to provide function predictions of smORFs/microproteins in at most 265 models generated from 173 datasets, including 48 tissues/cells, 82 diseases (and normal). The tool can be available at https://www.cuilab.cn/smorfunction .
Project description:Thousands of small Open Reading Frames (smORFs) with the potential to encode small peptides of fewer than 100 amino acids exist in our genomes. However, the number of smORFs actually translated, and their molecular and functional roles are still unclear. In this study, we present a genome-wide assessment of smORF translation by ribosomal profiling of polysomal fractions in Drosophila. We detect two types of smORFs bound by multiple ribosomes and thus undergoing productive translation. The 'longer' smORFs of around 80 amino acids resemble canonical proteins in translational metrics and conservation, and display a propensity to contain transmembrane motifs. The 'dwarf' smORFs are in general shorter (around 20 amino-acid long), are mostly found in 5'-UTRs and non-coding RNAs, are less well conserved, and have no bioinformatic indicators of peptide function. Our findings indicate that thousands of smORFs are translated in metazoan genomes, reinforcing the idea that smORFs are an abundant and fundamental genome component.
Project description:Recent advances in mass spectrometry-based proteomics have revealed translation of previously nonannotated microproteins from thousands of small open reading frames (smORFs) in prokaryotic and eukaryotic genomes. Facile methods to determine cellular functions of these newly discovered microproteins are now needed. Here, we couple semiquantitative comparative proteomics with whole-genome database searching to identify two nonannotated, homologous cold shock-regulated microproteins in Escherichia coli K12 substr. MG1655, as well as two additional constitutively expressed microproteins. We apply molecular genetic approaches to confirm expression of these cold shock proteins (YmcF and YnfQ) at reduced temperatures and identify the noncanonical ATT start codons that initiate their translation. These proteins are conserved in related Gram-negative bacteria and are predicted to be structured, which, in combination with their cold shock upregulation, suggests that they are likely to have biological roles in the cell. These results reveal that previously unknown factors are involved in the response of E. coli to lowered temperatures and suggest that further nonannotated, stress-regulated E. coli microproteins may remain to be found. More broadly, comparative proteomics may enable discovery of regulated, and therefore potentially functional, products of smORF translation across many different organisms and conditions.
Project description:Computational, genomic, and proteomic approaches have been used to discover nonannotated protein-coding small open reading frames (smORFs). Some novel smORFs have crucial biological roles in cells and organisms, which motivates the search for additional smORFs. Proteomic smORF discovery methods are advantageous because they detect smORF-encoded polypeptides (SEPs) to validate smORF translation and SEP stability. Because SEPs are shorter and less abundant than average proteins, SEP detection using proteomics faces unique challenges. Here, we optimize several steps in the SEP discovery workflow to improve SEP isolation and identification. These changes have led to the detection of several new human SEPs (novel human genes), improved confidence in the SEP assignments, and enabled quantification of SEPs under different cellular conditions. These improvements will allow faster detection and characterization of new SEPs and smORFs.
Project description:Proteogenomics methods have identified many non-annotated protein-coding genes in the human genome. Many of the newly discovered protein-coding genes encode peptides and small proteins, referred to collectively as microproteins. Microproteins are produced through ribosome translation of small open reading frames (smORFs). The discovery of many smORFs reveals a blind spot in traditional gene-finding algorithms for these genes. Biological studies have found roles for microproteins in cell biology and physiology, and the potential that there exists additional bioactive microproteins drives the interest in detection and discovery of these molecules. A key step in any proteogenomics workflow is the assembly of RNA-Seq data into likely mRNA transcripts that are then used to create a searchable protein database. Here we demonstrate that specific features of the assembled transcriptome impact microprotein detection by shotgun proteomics. By tailoring transcript assembly for downstream mass spectrometry searching, we show that we can detect more than double the number of high-quality microprotein candidates and introduce a novel open-source mRNA assembler for proteogenomics (MAPS) that incorporates all of these features. By integrating our specialized assembler, MAPS, and a popular generalized assembler into our proteogenomics pipeline, we detect 45 novel human microproteins from a high quality proteogenomics dataset of a human cell line. We then characterize the features of the novel microproteins, identifying two classes of microproteins. Our work highlights the importance of specialized transcriptome assembly upstream of proteomics validation when searching for short and potentially rare and poorly conserved proteins.
Project description:Micropeptides (?100 amino acids) are essential regulators of physiological and pathological processes, which can be encoded by small open reading frames (smORFs) derived from long non-coding RNAs (lncRNAs). Recently, lncRNA-encoded micropeptides have been shown to have essential roles in tumorigenesis. Since translated smORF identification remains technically challenging, little is known of their pathological functions in cancer. Therefore, we created classifiers to identify translated smORFs derived from lncRNAs based on ribosome-protected fragment sequencing and machine learning methods. In total, 537 putative translated smORFs were identified and the coding potential of five smORFs was experimentally validated via green fluorescent protein-tagged protein generation and mass spectrometry. After analyzing 11 lncRNA expression profiles of seven cancer types, we identified one validated translated lncRNA, ZFAS1, which was significantly up-regulated in hepatocellular carcinoma (HCC). Functional studies revealed that ZFAS1 can promote cancer cell migration by elevating intracellular reactive oxygen species production by inhibiting nicotinamide adenine dinucleotide dehydrogenase expression, indicating that translated ZFAS1 may be an essential oncogene in the progression of HCC. In this study, we systematically identified translated smORFs derived from lncRNAs and explored their potential pathological functions in cancer to improve our comprehensive understanding of the building blocks of living systems.
Project description:smORFs are small open reading frames of less than 100 codons. Recent low throughput experiments showed a lot of smORF-encoded peptides (SEPs) played crucial rule in processes such as regulation of transcription or translation, transportation through membranes and the antimicrobial activity. In order to gather more functional SEPs, it is necessary to have access to genome-wide prediction tools to give profound directions for low throughput experiments. In this study, we put forward a functional smORF-encoded peptides predictor (FSPP) which tended to predict authentic SEPs and their functions in a high throughput method. FSPP used the overlap of detected SEPs from Ribo-seq and mass spectrometry as target objects. With the expression data on transcription and translation levels, FSPP built two co-expression networks. Combing co-location relations, FSPP constructed a compound network and then annotated SEPs with functions of adjacent nodes. Tested on 38 sequenced samples of 5 human cell lines, FSPP successfully predicted 856 out of 960 annotated proteins. Interestingly, FSPP also highlighted 568 functional SEPs from these samples. After comparison, the roles predicted by FSPP were consistent with known functions. These results suggest that FSPP is a reliable tool for the identification of functional small peptides. FSPP source code can be acquired at https://www.bioinfo.org/FSPP.
Project description:Microproteins are peptides and small proteins encoded by small open reading frames (smORFs). Newer technologies have led to the recent discovery of hundreds to thousands of new microproteins. The biological functions of a few microproteins have been elucidated, and these microproteins have fundamental roles in biology ranging from limb development to muscle function, highlighting the value of characterizing these molecules. The identification of microprotein-protein interactions (MPIs) has proven to be a successful approach to the functional characterization of these genes; however, traditional immunoprecipitation methods result in the enrichment of nonspecific interactions for microproteins. Here, we test and apply an in situ proximity tagging method that relies on an engineered ascorbate peroxidase 2 (APEX) to elucidate MPIs. The results demonstrate that APEX tagging is superior to traditional immunoprecipitation methods for microproteins. Furthermore, the application of APEX tagging to an uncharacterized microprotein called C11orf98 revealed that this microprotein interacts with nucleolar proteins nucleophosmin and nucleolin, demonstrating the ability of this approach to identify novel hypothesis-generating MPIs.
Project description:Prokaryotic genome annotation is heavily dependent on automated gene annotation pipelines that are prone to propagate errors and underestimate genome complexity. We describe an optimized proteogenomic workflow that uses ribosome profiling (ribo-seq) and proteomic data for Salmonella enterica serovar Typhimurium to identify unannotated proteins or alternative protein forms. This data analysis encompasses the searching of cofragmenting peptides and postprocessing with extended peptide-to-spectrum quality features, including comparison to predicted fragment ion intensities. When this strategy is applied, an enhanced proteome depth is achieved, as well as greater confidence for unannotated peptide hits. We demonstrate the general applicability of our pipeline by reanalyzing public Deinococcus radiodurans data sets. Taken together, our results show that systematic reanalysis using available prokaryotic (proteome) data sets holds great promise to assist in experimentally based genome annotation.IMPORTANCE Delineation of open reading frames (ORFs) causes persistent inconsistencies in prokaryote genome annotation. We demonstrate that by advanced (re)analysis of omics data, a higher proteome coverage and sensitive detection of unannotated ORFs can be achieved, which can be exploited for conditional bacterial genome (re)annotation, which is especially relevant in view of annotating the wealth of sequenced prokaryotic genomes obtained in recent years.