Project description:Long-read RNA sequencing technologies offer unparalleled in- sights into transcriptomes by enabling full-length sequencing of RNA molecules, uncovering novel isoforms and alternative splicing events. While long-read sequencing platforms, such as Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), have historically been associated with higher error rates, recent advancements in both platforms have significantly en- hanced read accuracy, broadening their applicability for tran- scriptomic studies. With the rapid evolution of sequencing protocols and bioin- formatics tools, the trade-offs between sequencing throughput, read length, accuracy, and cost present significant challenges in selecting the optimal approach. Systematic benchmarking studies that compare these options are crucial to inform fu- ture research directions. However, many existing benchmark- ing datasets with matched data across multiple platforms have limitations, including: 1) a lack of realistic biological replicates, which may restrict the generalisability of differential analysis results to real-world scenarios, and 2) the use of earlier sequenc- ing kits, which may not reflect the latest advancements in se- quencing technology, limiting their relevance for future studies that typically use newer sequencing protocols. Here we present LongBench, a comprehensive benchmarking dataset designed to fill these critical gaps. Derived from eight lung cancer cell lines with synthetic RNA spike-ins, LongBench includes bulk, single-cell, and single-nucleus RNA-seq data from three state-of-the-art long-read sequencing platforms — ONT PCR-cDNA, ONT direct RNA, PacBio Kinnex — alongside Il- lumina short-read data for robust cross-platform comparisons. The LongBench dataset is a valuable resource for benchmarking and improving sequencing protocols and bioinformatics tools. With the LongBench dataset we present a systematic evaluation of transcript capture, quantification, and differential expression analyses, examining the strengths and limitations of each se- quencing platform in various biological contexts, enabling re- searchers to make more informed decisions on platform and method selection.

Project description:Long-read sequencing has become a powerful tool for alternative splicing analysis. However, technical and computational challenges have limited our ability to couple long-read sequencing with single cell and spatial barcoding to explore alternative splicing in the single cell and spatial setting. Though Nanopore-based long reads sequencing are widelyhave been adopted applied to explore single cell alternative and spatially barcoded librariessplicing in recent research, there still exist technical issues have problems which could bias the hindered accurate single cell isoform-level quantification, which are not well addressed in such settings. First, Tthe relatively higher sequencing error of Nanopore long reads, despite the recent improvements, has limited the accuracy ofhinder cell barcode and unique molecular identifier (UMI) recovery, a necessary first step in the analysis of single cell/spatial sequencing data. Then Rread truncation and mapping errors, the latter exacerbated by the higher sequencing error rates, further leads to the false detection of spurious new isoformsdegrade quantification accuracy. We show that these technical issues persist despite the recent improvements in long read sequencing accuracy. Beyond the initial data pre-processing, in downstream analysis we are lacking a statistical framework to quantify splicing variation within and between cells/spots. In light of these multiple challenges, we developed Longcell, a statistical framework and computational pipeline for isoform quantification using single cell and spatial spot barcoded Nanopore long read sequencing data. Longcell performs computationally efficient cell/spot barcode extraction, UMI recovery, and UMI-based truncation- and mapping-error correction. Through a statistical model that accounts for varying read coverage across cells/spots, Longcell rigorously quantifies the level of inter-cell/spot versus intra-cell/ spot diversity in exon-usage and detects changes in splicing distributions between cell populations. Applying Longcell to single cell long-read data from multiple contexts, we found that intra-cell splicing heterogeneity, where multiple isoforms co-exist within the same cell, is ubiquitous for highly expressed genes. On matched single cell and Visium long read sequencing for a tissue of colorectal cancer metastasis to the liver, Longcell found concordant signals between the single cell and spatial data modalities. On Visium long read sequencing data for multiple tissues, Longcell allows accurate identification of spatial isoform switching. Finally, on a perturbation experiment for 9 splicing factors, Longcell identified regulatory targets that are validated by targeted sequencing.

Dataset Information

Longread_umi

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets