Dataset Information

Generation of ENSEMBL-based proteogenomics databases boost the identification of novel peptides

ABSTRACT: A novel bioinformatics tool pypgatk and the pgdb workflow is presented in study to create proteogenomics databases based on ENSEMBL resources. The tools allow the generation of protein sequences from novel protein-coding transcripts by performing a three-frame translation of pseudogenes, lncRNAs, and other non-canonical transcripts, such as those produced by alternative splicing events. It also includes exonic out-of-frame translation from otherwise canonical protein-coding mRNAs. Moreover, the tool enables the generation of variant protein sequences from multiple sources of genomic variants including COSMIC, cBioportal, gnomAD, and mutations detected from sequencing of patient samples. pypgatk and pgdb provide multiple functionalities for database handling, notably optimized target/decoy generation by the algorithm DecoyPyrat. Finally, we perform a reanalysis of four public datasets in PRIDE by generating cell-type specific databases for 65 cell lines using the pypgatk and pgdb workflow, revealing a wealth of non-canonical or cryptic peptides amounting to more than 10% of the total number of peptides identified (43,501 out of 402,512).

INSTRUMENT(S):

ORGANISM(S): Homo Sapiens (human)

TISSUE(S): Lung

DISEASE(S): Lung Adenocarcinoma

SUBMITTER: Yasset Perez-Riverol

LAB HEAD: Yasset Perez-Riverol

PROVIDER: PXD029360 | Pride | 2021-10-26

REPOSITORIES: Pride

ACCESS DATA

Dataset's files

Source:

			Action	DRS
	000228_A01_P001360_B00A_A00_R1.mzML.gz	Mzml
	000228_A02_P001360_B00I_A00_R1.mzML.gz	Mzml
	000228_A03_P001359_B00E_A00_R1.mzML.gz	Mzml
	000228_A04_P001358_B00A_A00_R1.mzML.gz	Mzml
	000228_A05_P001358_B00I_A00_R1.mzML.gz	Mzml

Items per page:

1 - 5 of 1138

Publications

Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides.

Umer Husen M HM Audain Enrique E Zhu Yafeng Y Pfeuffer Julianus J Sachsenberg Timo T Lehtiö Janne J Branca Rui M RM Perez-Riverol Yasset Y

Bioinformatics (Oxford, England) 20220201 5

<h4>Summary</h4>We have implemented the pypgatk package and the pgdb workflow to create proteogenomics databases based on ENSEMBL resources. The tools allow the generation of protein sequences from novel protein-coding transcripts by performing a three-frame translation of pseudogenes, lncRNAs and other non-canonical transcripts, such as those produced by alternative splicing events. It also includes exonic out-of-frame translation from otherwise canonical protein-coding mRNAs. Moreover, the too ...[more]

PMID: 34904638

Similar Datasets

Project description:Arginine-rich mixed charge domains (R-MCDs) contribute to and alter the properties of nuclear speckles. We are interested in how this affects the retention of poly-adenylated mRNAs in the nucleus. This experiment tests how the expression of the R-MCD of PPIG influences the nuclear-cytoplasmic distribution of mRNAs over time. Specifically, a HeLa cell line in which PPIG expression is driven by a doxycycline-inducible promoter was used, and doxycycline was added for 0, 4, 8 or 12 hours. This time course was performed in triplicates. Nuclear and cytoplasmic fractions were collected from the cells and RNA was extracted. 3' end sequencing libraries were produced by fragmenting the RNA and introducing Illumina adapters via a oligo-dT-primed reverse transcription and template switching oligo approach. The data indicate that mRNAs containing long, multivalent GA-rich regions in their coding sequences are more retained in the nucleus over time following expression of the R-MCD. The files provided in this accession have not been trimmed to remove adapters, 3' poly-A sequences, UMIs or G-stretches from TSOs. The data was analysed using nf-core/rna-seq using the following command: nextflow run nf-core/rnaseq \\ --input samplesheet.csv \\ --fasta '/camp/lab/ulej/home/users/farawar/genomes/hs/fasta/GRCh38.primary_assembly.genome.fa' \\ --gtf '/camp/lab/ulej/home/users/farawar/genomes/hs/annotation/gencode.v29.annotation.gtf' \\ --salmon_index '/camp/lab/ulej/home/users/farawar/genomes/hs/salmon_index/salmon_index' \\ --gencode \\ --pseudo_aligner salmon \\ --with_umi \\ --umitools_bc_pattern NNNNN \\ --clip_r1 5 \\ -resume \\ -profile crick \\ --outdir nf-core-results \\ -c extraconfig.config With the extra config file containing: withName: '.*:QUANTIFY_STAR_SALMON:SALMON_QUANT' { ext.args = '--noLengthCorrection' } Therefore removing the first 5 nucleotides as UMI sequences and the following 5 nucleotides as they contain G-stretches from the template-switching oligo.