Project description:Long-read RNA sequencing technologies offer unparalleled in- sights into transcriptomes by enabling full-length sequencing of RNA molecules, uncovering novel isoforms and alternative splicing events. While long-read sequencing platforms, such as Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), have historically been associated with higher error rates, recent advancements in both platforms have significantly en- hanced read accuracy, broadening their applicability for tran- scriptomic studies. With the rapid evolution of sequencing protocols and bioin- formatics tools, the trade-offs between sequencing throughput, read length, accuracy, and cost present significant challenges in selecting the optimal approach. Systematic benchmarking studies that compare these options are crucial to inform fu- ture research directions. However, many existing benchmark- ing datasets with matched data across multiple platforms have limitations, including: 1) a lack of realistic biological replicates, which may restrict the generalisability of differential analysis results to real-world scenarios, and 2) the use of earlier sequenc- ing kits, which may not reflect the latest advancements in se- quencing technology, limiting their relevance for future studies that typically use newer sequencing protocols. Here we present LongBench, a comprehensive benchmarking dataset designed to fill these critical gaps. Derived from eight lung cancer cell lines with synthetic RNA spike-ins, LongBench includes bulk, single-cell, and single-nucleus RNA-seq data from three state-of-the-art long-read sequencing platforms — ONT PCR-cDNA, ONT direct RNA, PacBio Kinnex — alongside Il- lumina short-read data for robust cross-platform comparisons. The LongBench dataset is a valuable resource for benchmarking and improving sequencing protocols and bioinformatics tools. With the LongBench dataset we present a systematic evaluation of transcript capture, quantification, and differential expression analyses, examining the strengths and limitations of each se- quencing platform in various biological contexts, enabling re- searchers to make more informed decisions on platform and method selection.
2025-09-01 | GSE303762 | GEO
Project description:Benchmarking long-read RNA-sequencing technologies with LongBench: a multi-group dataset spanning bulk and single-cell modalities
Project description:The Long-read POG dataset comprises a cohort of 189 patient tumours and 41 matched normal samples sequenced using the Oxford Nanopore Technologies PromethION platform. This dataset from the Personalized Oncogenomics (POG) program and the Marathon of Hope Cancer Centres Network includes accompanying DNA and RNA short-read sequence data, analytics, and clinical information. We show the potential of long-read sequencing for resolving complex cancer-related structural variants, viral integrations, and extrachromosomal circular DNA. Long-range phasing of variants facilitates the discovery of allelically differentially methylated regions (aDMRs) and allele-specific expression, including recurrent aDMRs in the cancer genes RET and CDKN2A. Germline promoter methylation in MLH1 can be directly observed in Lynch syndrome. Promoter methylation in BRCA1 and RAD51C is a likely driver behind patterns of homologous recombination deficiency where no driver mutation was found. This dataset demonstrates applications for long-read sequencing in precision medicine, and is available as a resource for developing analytical approaches using this technology.
Project description:Ongoing improvements to next generation sequencing technologies are leading to longer sequencing read lengths, but a thorough understanding of the impact of longer reads on RNA sequencing analyses is lacking. To address this issue, we generated and compared two RNA sequencing datasets of differing read lengths -- 2x75 bp (L75) and 2x262 bp (L262) -- and investigated the impact of read length on various aspects of analysis, including the performance of currently available read-mapping tools, gene and transcript quantification, and detection of allele-specific expression patterns. Our results indicate that, while the scalability of read-mapping tools and the cost-effectiveness of long read protocol is an issue that requires further attention, longer reads enable more accurate quantification of diverse aspects of gene expression, including individual-specific patterns of allele-specific expression and alternative splicing. Two RNA-Seq datasets of differing read lengths (2x262 bp and 2x75 bp)
Project description:Long-read sequencing technologies such as Iso-Seq (PacBio Inc.) generate highly accurate sequences of full-length mRNA transcript isoforms. Long-read transcriptomics may be especially useful in the context of lymphocyte functional plasticity as it relates to human health and disease. However, no long-read isoform-aware reference transcriptomes of human circulating lymphocytes seem to be publicly available despite being valuable as benchmarks in a variety of transcriptomic studies. To begin to fill this gap, we purified four lymphocyte subsets (CD4 T, CD8 T, NK, and Pan B cells) from the peripheral blood of a healthy male donor and obtained high-quality RNA (RIN>8) for PacBio Iso-Seq analysis and parallel RNA-Seq analysis.
Project description:Long-read sequencing technologies such as Iso-Seq (PacBio Inc.) generate highly accurate sequences of full-length mRNA transcript isoforms. Long-read transcriptomics may be especially useful in the context of lymphocyte functional plasticity as it relates to human health and disease. However, no long-read isoform-aware reference transcriptomes of human circulating lymphocytes seem to be publicly available despite being valuable as benchmarks in a variety of transcriptomic studies. To begin to fill this gap, we purified four lymphocyte subsets (CD4 T, CD8 T, NK, and Pan B cells) from the peripheral blood of a healthy male donor and obtained high-quality RNA (RIN>8) for PacBio Iso-Seq analysis and parallel RNA-Seq analysis.
Project description:Ongoing improvements to next generation sequencing technologies are leading to longer sequencing read lengths, but a thorough understanding of the impact of longer reads on RNA sequencing analyses is lacking. To address this issue, we generated and compared two RNA sequencing datasets of differing read lengths -- 2x75 bp (L75) and 2x262 bp (L262) -- and investigated the impact of read length on various aspects of analysis, including the performance of currently available read-mapping tools, gene and transcript quantification, and detection of allele-specific expression patterns. Our results indicate that, while the scalability of read-mapping tools and the cost-effectiveness of long read protocol is an issue that requires further attention, longer reads enable more accurate quantification of diverse aspects of gene expression, including individual-specific patterns of allele-specific expression and alternative splicing.
Project description:Adenovirus is a common human pathogen that relies on host cell processes for transcription and processing of viral RNA and protein production. Although adenoviral promoters, splice junctions, and cleavage and polyadenylation sites have been characterized using low-throughput biochemical techniques or short read cDNA-based sequencing, these technologies do not fully capture the complexity of the adenoviral transcriptome. By combining Illumina short-read and nanopore long-read direct RNA sequencing approaches, we mapped transcription start sites and cleavage and polyadenylation sites across the adenovirus genome. In addition to confirming the known canonical viral early and late RNA cassettes, our analysis of splice junctions within long RNA reads revealed an additional 35 novel viral transcripts. These RNAs include fourteen new splice junctions which lead to expression of canonical open reading frames (ORF), six novel ORF-containing transcripts, and fifteen transcripts encoding for messages that potentially alter protein functions through truncations or fusion of canonical ORFs. In addition, we also detect RNAs that bypass canonical cleavage sites and generate potential chimeric proteins by linking separate gene transcription units. Of these, an evolutionary conserved protein was detected containing the N-terminus of E4orf6 fused to the downstream DBP/E2A ORF. Loss of this novel protein, E4orf6/DBP, was associated with aberrant viral replication center morphology and poor viral spread. Our work highlights how long-read sequencing technologies can reveal further complexity within viral transcriptomes.