Dataset Information

ESPRESSO: Robust discovery and quantification of transcript isoforms from error-prone long-read RNA-seq data

ABSTRACT: Long-read RNA sequencing (RNA-seq) holds great potential for characterizing transcriptome variation and full-length transcript isoforms, but the relatively high error rate of current long-read sequencing platforms poses a major challenge. We present ESPRESSO, a computational tool for robust discovery and quantification of transcript isoforms from error-prone long reads. ESPRESSO jointly considers alignments of all long reads aligned to a gene and uses error profiles of individual reads to improve the identification of splice junctions and the discovery of their corresponding transcript isoforms. On both a synthetic spike-in RNA sample and human RNA samples, ESPRESSO outperforms multiple contemporary tools in not only transcript isoform discovery but also transcript isoform quantification. In total, we generated and analyzed ~1.1 billion nanopore RNA-seq reads covering 30 human tissue samples and three human cell lines. ESPRESSO and its companion dataset provide a useful resource for studying the RNA repertoire of eukaryotic transcriptomes.

ORGANISM(S): Homo sapiens

PROVIDER: GSE192955 | GEO | 2022/10/14

REPOSITORIES: GEO

ACCESS DATA

Dataset's files

Source:

			Action	DRS
		Other

Items per page:

1 - 1 of 1

Similar Datasets

Project description:Long-read sequencing has become a powerful tool for alternative splicing analysis. However, technical and computational challenges have limited our ability to couple long-read sequencing with single cell and spatial barcoding to explore alternative splicing in the single cell and spatial setting. Though Nanopore-based long reads sequencing are widelyhave been adopted applied to explore single cell alternative and spatially barcoded librariessplicing in recent research, there still exist technical issues have problems which could bias the hindered accurate single cell isoform-level quantification, which are not well addressed in such settings. First, Tthe relatively higher sequencing error of Nanopore long reads, despite the recent improvements, has limited the accuracy ofhinder cell barcode and unique molecular identifier (UMI) recovery, a necessary first step in the analysis of single cell/spatial sequencing data. Then Rread truncation and mapping errors, the latter exacerbated by the higher sequencing error rates, further leads to the false detection of spurious new isoformsdegrade quantification accuracy. We show that these technical issues persist despite the recent improvements in long read sequencing accuracy. Beyond the initial data pre-processing, in downstream analysis we are lacking a statistical framework to quantify splicing variation within and between cells/spots. In light of these multiple challenges, we developed Longcell, a statistical framework and computational pipeline for isoform quantification using single cell and spatial spot barcoded Nanopore long read sequencing data. Longcell performs computationally efficient cell/spot barcode extraction, UMI recovery, and UMI-based truncation- and mapping-error correction. Through a statistical model that accounts for varying read coverage across cells/spots, Longcell rigorously quantifies the level of inter-cell/spot versus intra-cell/ spot diversity in exon-usage and detects changes in splicing distributions between cell populations. Applying Longcell to single cell long-read data from multiple contexts, we found that intra-cell splicing heterogeneity, where multiple isoforms co-exist within the same cell, is ubiquitous for highly expressed genes. On matched single cell and Visium long read sequencing for a tissue of colorectal cancer metastasis to the liver, Longcell found concordant signals between the single cell and spatial data modalities. On Visium long read sequencing data for multiple tissues, Longcell allows accurate identification of spatial isoform switching. Finally, on a perturbation experiment for 9 splicing factors, Longcell identified regulatory targets that are validated by targeted sequencing.

Dataset Information

ESPRESSO: Robust discovery and quantification of transcript isoforms from error-prone long-read RNA-seq data

Dataset's files

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets