Dataset Information

Arabidopsis nanopore direct RNA sequencing data for mutants in splicing, polyadenylation, and m6A methylation

ABSTRACT: Co-transcriptional RNA processing, including alternative splicing and polyadenylation, generates diverse mRNA isoforms essential for development, differentiation, and stress responses. We present Oxford Nanopore direct RNA sequencing (DRS) data from Arabidopsis thaliana wild-type (Col-0) plants and mutants defective in RNA processing pathways, including polyadenylation, RNA methylation, and pre-mRNA splicing. We obtained full-length reads together with native poly(A) tail information. The provided resources include FAST5 and FASTQ files, poly(A) tail length estimates from nanopolish, poly(A) composition data from ninetails, and differential expression analyses generated with DESeq2, all aligned to the TAIR10 reference transcriptome. These data offer a comprehensive view of RNA processing at the single-molecule level and can be reused to explore alternative polyadenylation, RNA modification patterns, or splicing changes across various genetic backgrounds. This resource is intended to support studies of gene expression regulation and RNA metabolism in Arabidopsis thaliana and related plant systems. In this repository, the following data are provided: Figure_1.pdf - Overview of RNA-processing factors, mutant lines, and nanopore DRS workflow. a. Schematic representation of human U1 snRNP and RNA 3′-end processing machinery, according to So et al. (https://doi.org/10.1016/j.molcel.2019.08.007). Orthologs of factors mutated in this study are marked in red and yellow. b. Arabidopsis genotypes sequenced in this study, including components of the AtCFI complex, U1 snRNP, and RNA methylation. c. Representative phenotypes of plants included in this study, captured in three-week-old plants with a red 10 mm scale bar shown in each image. d. Experimental workflow for nanopore direct RNA sequencing (DRS), including RNA extraction, library preparation, sequencing, basecalling, mapping, and poly(A) tail analysis, followed by potential downstream analyses such as differential expression, differential adenylation, non-adenosine profiling, and alternative polyadenylation. Figure_2.pdf - Technical validation of the data: read quality, mapping summary, and sample-level variation across nanopore DRS libraries. a. Read count (left) and read frequency (right) distributions for all samples across genotypes. Nanopolish quality-control tags are shown. b. Summary of read assignment produced with FeatureCounts. Bars show the proportion of reads classified as respective type. c. Principal component analysis (PCA) of log₂-scaled count data (DESeq2), illustrating sample clustering and variance structure across genotypes and replicates. d. Transcript count correlation plot. Correlation analyses of read counts/transcripts were performed using Spearman’s rank correlation coefficient. Mapped reads fulfilling nanopolish quality criteria are included. Figure_3.pdf - Distributions of poly(A)-tail lengths across Arabidopsis genotypes. a. Poly(A)-tail length distributions in all datasets, shown either by individual sample (left) or by genotype group (right). Vertical dashed lines indicate median poly(A)-tail lengths for each genotype b. Poly(A)-tail length distributions for transcripts belonging to selected Gene Ontology (GO) categories, grouped by genotype. Shown are transcripts annotated with extracellular region (GO:0005576), extracellular matrix (GO:0031012), mRNA 3′-end processing (GO:0031124), mRNA modification (GO:0016556), spliceosome assembly (GO:0000245), and cis-splicing (GO:0045292). Median poly(A)-tail lengths for each genotype are indicated by vertical dashed lines. Figure_4.pdf - Nucleotide composition of poly(A) tails. a. Read classification generated by Ninetails for poly(A) tail composition analysis: left – detailed classification output, right – only decorated (containing non-adenosines) reads shown. b. Frequency of occurrence of respective non-adenosines (cytidine, guanosine, uridine) in poly(A) tails reported by residue. In this reporting mode, each individual non-adenosine nucleotide contributes separately to the count; thus, a read containing three uridines contributes three events, while a read containing two cytidines, one guanosine, and one uridine contributes four events. This approach reflects the total abundance of non-adenosine nucleotides within poly(A) tails rather than the number of reads containing them. Only reads classified by the neural network (decorated and blank) were included. The methodology for non-adenosine reporting has been described in detail in our previous work. c. Examples of poly(A) tail signals: left – blank (i.e. with adenosines exclusively), right – decorated tails (i.e. containing non-adenosines). Reads were randomly sampled from blank and decorated classes in the Col-0 rep1 sample. Table1_sample_metadata.xlsx - Excel file with 3 sheets: sequencing_runs - metadata of runs included in this study, including conditions, replicates, accession numbers, reads produced, N50 and median PHRED score samples_summary - polya prediction summary for data aggregated by sample (biological replicate) group_summary - polya prediction summary for data aggregated by group (genotype) transcripts_by_sample - polya prediction summary for transcripts aggregated by sample (biological replicate) transcripts_by_group - polya prediction summary for transcripts aggregated by group (genotype) Table2_polyadenylation_nanopolish.xlsx - Excel file with differential adenylation results (each mutant vs wild type; each sheet contains one such comparison). Statistical significance was assessed using the two-sided Wilcoxon signed-rank test (α = 0.05). Each of the subtable (sheet) contains the following columns: ensembl_transcript - transcript identifier in Ensembl format p.value - statistical significance calculated using a two-sided Wilcoxon rank-sum test with alpha = 0.05 stats_code - a quality indicator showing whether read coverage in both conditions was adequate to support reliable statistical inference cohen_d - effect size (an auxiliary metric that helps discern transcripts with differences in poly-A tail length between conditions, even when statistical significance may be driven primarily by high read counts) *_counts - number of mapped reads in control and mutant samples, respectively *_polya_gm_mean - geometric mean of poly(A) tail length in each condition, respectively length_diff - a measure of the change in poly(A) tail length between control and mutant samples [nt] fold_change - the magnitude of length_diff padj - adjusted p-value (FDR-corrected) controlling for multiple testing significance effect_size - descriptive measure of the magnitude of the change significance - categorical label (e.g., FDR<0.05 / NotSig) based on padj threshold indicating whether the observed difference is statistically significant symbol - gene symbol chr - chromosome identifier ensembl_gene_id - gene identifier corresponding to the transcript in Ensembl format transcript_biotype - classification of transcript type (protein_coding, lncRNA, pseudogene, etc.). description - short functional summary of each gene Table3_differential_expression_deseq2.xlsx - Excel file with differential expression results (each mutant vs wild type; each sheet contains one such comparison). Inference performed with DESeq2 on default settings. Each of the subtable (sheet) contains the following columns: ensembl_transcript - transcript identifier in Ensembl format baseMean - mean normalized read count across all samples for each transcript baseMean_1 - mean normalized read counts calculated separately for condition 1 baseMean_2 - mean normalized read counts calculated separately for condition 2 log2FoldChange - log-transformed fold change (base 2) between conditions lfcSE - standard error associated with the log2FoldChange estimate stat - Wald statistic, calculated as the ratio of log2FoldChange to its standard error foldChange - fold change between conditions on the original (non-logarithmic) scale pvalue - raw p-values from the Wald test padj - adjusted p-value (FDR-corrected) controlling for multiple testing significance sig - categorical label (e.g., FDR<0.05 / Not Sig) based on padj threshold indicating whether the observed difference is statistically significant symbol - gene symbol chr - chromosome identifier ensembl_gene_id - gene identifier corresponding to the transcript in Ensembl format transcript_biotype - classification of transcript type (protein_coding, lncRNA, pseudogene, etc.). description - short functional summary of each gene PycoQC_output.tar - interactive QC reports for sequencing runs included in this study Nanopolish_output.tar - results of poly(A) lengths estimations with nanopolish polya function Ninetails_output.tar - results of poly(A) nucleotide composition predictions with Ninetails check_tails() function FeatureCounts_output.tar - results of read to gene feature mapping with featureCounts from Subread package used for differential expression calulation with DESeq2 fastq_files.tar.gz - compressed FASTQ files mapping to TAIR10 reference corresponding to each sample.

ORGANISM(S): Arabidopsis thaliana (thale cress)

SUBMITTER: Natalia Gumińska

PROVIDER: S-BSST3126 | biostudies-other |

REPOSITORIES: biostudies-other

ACCESS DATA

Dataset Information

Arabidopsis nanopore direct RNA sequencing data for mutants in splicing, polyadenylation, and m6A methylation

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Arabidopsis nanopore direct RNA sequencing data for mutants in polyadenylation methylation and splicing
| PRJEB103075 | ENA

Direct sequencing of Arabidopsis thaliana RNA reveals patterns of cleavage and polyadenylation.
| S-EPMC3533403 | biostudies-literature

Detecting m6A methylation regions from Methylated RNA Immunoprecipitation Sequencing.
| S-EPMC9991887 | biostudies-literature

Direct RNA Sequencing Reveals SARS-CoV-2 m6A Sites and Possible Differential DRACH Motif Methylation among Variants.
| S-EPMC8620083 | biostudies-literature

Mapping alternative polyadenylation in human cells using direct RNA sequencing technology.
| S-EPMC10362186 | biostudies-literature

REPAC: analysis of alternative polyadenylation from RNA-sequencing data.
| S-EPMC9912678 | biostudies-literature

High-throughput m6A-seq reveals RNA m6A methylation patterns in the chloroplast and mitochondria transcriptomes of Arabidopsis thaliana.
| S-EPMC5683568 | biostudies-literature

Benchmarking of computational methods for m6A profiling with Nanopore direct RNA sequencing.
| S-EPMC10818168 | biostudies-literature

Alteration of RNA m6A methylation mediates aberrant RNA binding protein expression and alternative splicing in condyloma acuminatum.
| S-EPMC11114121 | biostudies-literature

DIPAN: Detecting personalized intronic polyadenylation derived neoantigens from RNA sequencing data.
| S-EPMC11112131 | biostudies-literature