ABSTRACT: Co-transcriptional RNA processing, including alternative splicing and polyadenylation, generates diverse mRNA isoforms essential for development, differentiation, and stress responses. We present Oxford Nanopore direct RNA sequencing (DRS) data from Arabidopsis thaliana wild-type (Col-0) plants and mutants defective in RNA processing pathways, including polyadenylation, RNA methylation, and pre-mRNA splicing. We obtained full-length reads together with native poly(A) tail information. The provided resources include FAST5 and FASTQ files, poly(A) tail length estimates from nanopolish, poly(A) composition data from ninetails, and differential expression analyses generated with DESeq2, all aligned to the TAIR10 reference transcriptome. These data offer a comprehensive view of RNA processing at the single-molecule level and can be reused to explore alternative polyadenylation, RNA modification patterns, or splicing changes across various genetic backgrounds. This resource is intended to support studies of gene expression regulation and RNA metabolism in Arabidopsis thaliana and related plant systems.
In this repository, the following data are provided:
Figure_1.pdf - Overview of RNA-processing factors, mutant lines, and nanopore DRS workflow.
a. Schematic representation of human U1 snRNP and RNA 3′-end processing machinery, according to So et al. (https://doi.org/10.1016/j.molcel.2019.08.007). Orthologs of factors mutated in this study are marked in red and yellow.
b. Arabidopsis genotypes sequenced in this study, including components of the AtCFI complex, U1 snRNP, and RNA methylation.
c. Representative phenotypes of plants included in this study, captured in three-week-old plants with a red 10 mm scale bar shown in each image.
d. Experimental workflow for nanopore direct RNA sequencing (DRS), including RNA extraction, library preparation, sequencing, basecalling, mapping, and poly(A) tail analysis, followed by potential downstream analyses such as differential expression, differential adenylation, non-adenosine profiling, and alternative polyadenylation.
Figure_2.pdf - Technical validation of the data: read quality, mapping summary, and sample-level variation across nanopore DRS libraries.
a. Read count (left) and read frequency (right) distributions for all samples across genotypes. Nanopolish quality-control tags are shown.
b. Summary of read assignment produced with FeatureCounts. Bars show the proportion of reads classified as respective type.
c. Principal component analysis (PCA) of log₂-scaled count data (DESeq2), illustrating sample clustering and variance structure across genotypes and replicates.
d. Transcript count correlation plot. Correlation analyses of read counts/transcripts were performed using Spearman’s rank correlation coefficient. Mapped reads fulfilling nanopolish quality criteria are included.
Figure_3.pdf - Distributions of poly(A)-tail lengths across Arabidopsis genotypes.
a. Poly(A)-tail length distributions in all datasets, shown either by individual sample (left) or by genotype group (right). Vertical dashed lines indicate median poly(A)-tail lengths for each genotype
b. Poly(A)-tail length distributions for transcripts belonging to selected Gene Ontology (GO) categories, grouped by genotype. Shown are transcripts annotated with extracellular region (GO:0005576), extracellular matrix (GO:0031012), mRNA 3′-end processing (GO:0031124), mRNA modification (GO:0016556), spliceosome assembly (GO:0000245), and cis-splicing (GO:0045292). Median poly(A)-tail lengths for each genotype are indicated by vertical dashed lines.
Figure_4.pdf - Nucleotide composition of poly(A) tails.
a. Read classification generated by Ninetails for poly(A) tail composition analysis: left – detailed classification output, right – only decorated (containing non-adenosines) reads shown.
b. Frequency of occurrence of respective non-adenosines (cytidine, guanosine, uridine) in poly(A) tails reported by residue. In this reporting mode, each individual non-adenosine nucleotide contributes separately to the count; thus, a read containing three uridines contributes three events, while a read containing two cytidines, one guanosine, and one uridine contributes four events. This approach reflects the total abundance of non-adenosine nucleotides within poly(A) tails rather than the number of reads containing them. Only reads classified by the neural network (decorated and blank) were included. The methodology for non-adenosine reporting has been described in detail in our previous work.
c. Examples of poly(A) tail signals: left – blank (i.e. with adenosines exclusively), right – decorated tails (i.e. containing non-adenosines). Reads were randomly sampled from blank and decorated classes in the Col-0 rep1 sample.
Table1_sample_metadata.xlsx - Excel file with 3 sheets:
sequencing_runs - metadata of runs included in this study, including conditions, replicates, accession numbers, reads produced, N50 and median PHRED score
samples_summary - polya prediction summary for data aggregated by sample (biological replicate)
group_summary - polya prediction summary for data aggregated by group (genotype)
transcripts_by_sample - polya prediction summary for transcripts aggregated by sample (biological replicate)
transcripts_by_group - polya prediction summary for transcripts aggregated by group (genotype)
Table2_polyadenylation_nanopolish.xlsx - Excel file with differential adenylation results (each mutant vs wild type; each sheet contains one such comparison). Statistical significance was assessed using the two-sided Wilcoxon signed-rank test (α = 0.05). Each of the subtable (sheet) contains the following columns:
ensembl_transcript - transcript identifier in Ensembl format
p.value - statistical significance calculated using a two-sided Wilcoxon rank-sum test with alpha = 0.05
stats_code - a quality indicator showing whether read coverage in both conditions was adequate to support reliable statistical inference
cohen_d - effect size (an auxiliary metric that helps discern transcripts with differences in poly-A tail length between conditions, even when statistical significance may be driven primarily by high read counts)
*_counts - number of mapped reads in control and mutant samples, respectively
*_polya_gm_mean - geometric mean of poly(A) tail length in each condition, respectively
length_diff - a measure of the change in poly(A) tail length between control and mutant samples [nt]
fold_change - the magnitude of length_diff
padj - adjusted p-value (FDR-corrected) controlling for multiple testing significance
effect_size - descriptive measure of the magnitude of the change
significance - categorical label (e.g., FDR<0.05 / NotSig) based on padj threshold indicating whether the observed difference is statistically significant
symbol - gene symbol
chr - chromosome identifier
ensembl_gene_id - gene identifier corresponding to the transcript in Ensembl format
transcript_biotype - classification of transcript type (protein_coding, lncRNA, pseudogene, etc.).
description - short functional summary of each gene
Table3_differential_expression_deseq2.xlsx - Excel file with differential expression results (each mutant vs wild type; each sheet contains one such comparison). Inference performed with DESeq2 on default settings. Each of the subtable (sheet) contains the following columns:
ensembl_transcript - transcript identifier in Ensembl format
baseMean - mean normalized read count across all samples for each transcript
baseMean_1 - mean normalized read counts calculated separately for condition 1
baseMean_2 - mean normalized read counts calculated separately for condition 2
log2FoldChange - log-transformed fold change (base 2) between conditions
lfcSE - standard error associated with the log2FoldChange estimate
stat - Wald statistic, calculated as the ratio of log2FoldChange to its standard error
foldChange - fold change between conditions on the original (non-logarithmic) scale
pvalue - raw p-values from the Wald test
padj - adjusted p-value (FDR-corrected) controlling for multiple testing significance
sig - categorical label (e.g., FDR<0.05 / Not Sig) based on padj threshold indicating whether the observed difference is statistically significant
symbol - gene symbol
chr - chromosome identifier
ensembl_gene_id - gene identifier corresponding to the transcript in Ensembl format
transcript_biotype - classification of transcript type (protein_coding, lncRNA, pseudogene, etc.).
description - short functional summary of each gene
PycoQC_output.tar - interactive QC reports for sequencing runs included in this study
Nanopolish_output.tar - results of poly(A) lengths estimations with nanopolish polya function
Ninetails_output.tar - results of poly(A) nucleotide composition predictions with Ninetails check_tails() function
FeatureCounts_output.tar - results of read to gene feature mapping with featureCounts from Subread package used for differential expression calulation with DESeq2
fastq_files.tar.gz - compressed FASTQ files mapping to TAIR10 reference corresponding to each sample.