De novo assembly of wheat root transcriptomes and transcriptional signature of longitudinal differentiation.
ABSTRACT: Hidden underground, root systems constitute an important part of the plant for its development, nourishment and sensing the soil environment around it, but we know very little about its genetic regulation in crop plants like wheat. In the present study, we de novo assembled the root transcriptomes in reference cultivar Chinese Spring from RNA-seq reads generated by the 454-GS-FLX and HiSeq platforms. The FLX reads were assembled into 24,986 transcripts with completeness of 54.84%, and the HiSeq reads were assembled into 91,543 high-confidence protein-coding transcripts, 2,404 low-confidence protein-coding transcripts, and 13,181 non-coding transcripts with the completeness of >90%. Combining the FLX and HiSeq assemblies, we assembled a root transcriptome of 92,335 ORF-containing transcripts. Approximately 7% of the coding transcripts and ~2% non-coding transcripts are not present in the current wheat genome assembly. Functional annotation of both assemblies showed similar gene ontology patterns and that ~7% coding and >5% non-coding transcripts are root-specific. Transcription quantification identified 1,728 differentially expressed transcripts between root tips and maturation zone, and functional annotation of these transcripts captured a transcriptional signature of longitudinal development of wheat root. With the transcriptomic resources developed, this study provided the first view of wheat root transcriptome under different developmental zones and laid a foundation for molecular studies of wheat root development and growth using a reverse genetic approach.
Project description:PacBio long reads sequencing presents several potential advantages for DNA assembly, including being able to provide more complete gene profiling of metagenomic samples. However, lower single-pass accuracy can make gene discovery and assembly for low-abundance organisms difficult. To evaluate the application and performance of PacBio long reads and Illumina HiSeq short reads in metagenomic analyses, we directly compared various assemblies involving PacBio and Illumina sequencing reads based on two anaerobic digestion microbiome samples from a biogas fermenter. Using a PacBio platform, 1.58 million long reads (19.6 Gb) were produced with an average length of 7,604 bp. Using an Illumina HiSeq platform, 151.2 million read pairs (45.4 Gb) were produced. Hybrid assemblies using PacBio long reads and HiSeq contigs produced improvements in assembly statistics, including an increase in the average contig length, contig N50 size, and number of large contigs. Interestingly, depth-based hybrid assemblies generated a higher percentage of complete genes (98.86%) compared to those based on HiSeq contigs only (40.29%), because the PacBio reads were long enough to cover many repeating short elements and capture multiple genes in a single read. Additionally, the incorporation of PacBio long reads led to considerable advantages regarding reducing contig numbers and increasing the completeness of the genome reconstruction, which was poorly assembled and binned when using HiSeq data alone. From this comparison of PacBio long reads with Illumina HiSeq short reads related to complex microbiome samples, we conclude that PacBio long reads can produce longer contigs, more complete genes, and better genome binning, thereby offering more information about metagenomic samples.
Project description:Wheat is a staple food worldwide and provides 40% of the calories in the diet. Climate change and global warming pose a threat to wheat production, however, and demand a deeper understanding of how heat stress might impact wheat production and wheat biology. However, it is difficult to identify novel heat stress associated genes when the genomic information is not available. Wheat has a very large and complex genome that is about 37 times the size of the rice genome. The present study sequenced the whole transcriptome of the wheat cv. HD2329 at the flowering stage, under control (22°±3°C) and heat stress (42°C, 2?h) conditions using Illumina HiSeq and Roche GS-FLX 454 platforms. We assembled more than 26.3 and 25.6 million high-quality reads from the control and HS-treated tissues transcriptome sequences respectively. About 76,556 (control) and 54,033 (HS-treated) contigs were assembled and annotated de novo using different assemblers and a total of 21,529 unigenes were obtained. Gene expression profile showed significant differential expression of 1525 transcripts under heat stress, of which 27 transcripts showed very high (>10) fold upregulation. Cellular processes such as metabolic processes, protein phosphorylation, oxidations-reductions, among others were highly influenced by heat stress. In summary, these observations significantly enrich the transcript dataset of wheat available on public domain and show a de novo approach to discover the heat-responsive transcripts of wheat, which can accelerate the progress of wheat stress-genomics as well as the course of wheat breeding programs in the era of climate change.
Project description:BACKGROUND:Panax ginseng Meyer is a traditional medicinal plant famous for its strong therapeutic effects and serves as an important herbal medicine. To understand and manipulate genes involved in secondary metabolic pathways including ginsenosides, transcriptome profiling of P. ginseng is essential. METHODS:RNA-seq analysis of adventitious roots of two P. ginseng cultivars, Chunpoong (CP) and Cheongsun (CS), was performed using the Illumina HiSeq platform. After transcripts were assembled, expression profiling was performed. RESULTS:Assemblies were generated from ?85 million and ?77 million high-quality reads from CP and CS cultivars, respectively. A total of 35,527 and 27,716 transcripts were obtained from the CP and CS assemblies, respectively. Annotation of the transcriptomes showed that approximately 90% of the transcripts had significant matches in public databases. We identified several candidate genes involved in ginsenoside biosynthesis. In addition, a large number of transcripts (17%) with different gene ontology designations were uniquely detected in adventitious roots compared to normal ginseng roots. CONCLUSION:This study will provide a comprehensive insight into the transcriptome of ginseng adventitious roots, and a way for successful transcriptome analysis and profiling of resource plants with less genomic information. The transcriptome profiling data generated in this study are available in our newly created adventitious root transcriptome database (http://im-crop.snu.ac.kr/transdb/index.php) for public use.
Project description:Next generation sequencing platforms have recently been used to rapidly characterize transcriptome sequences from a number of non-model organisms. The present study compares two of the most frequently used platforms, the Roche 454-pyrosequencing and the Illumina sequencing-by-synthesis (SBS), on the same RNA sample obtained from an intertidal gastropod mollusc species, Haliotis midae. All the sequencing reads were deposited in the Short Read Archive (SRA) database are retrievable under the accession number [SRR071314 (Illumina Genome Analyzer II)] and [SRR1737738, SRR1737737, SRR1737735, SRR1737734 (454 GS FLX)] in the SRA database of NCBI. Three transcriptomes, composed of either pure 454 or Illumina reads or a mixture of read types (Hybrid), were assembled using CLC Genomics Workbench software. Illumina assemblies performed the best de novo transcriptome characterization in terms of contig length, whereas the 454 assemblies tended to improve the complete assembly of gene transcripts. Both the Hybrid and Illumina assemblies produced longer contigs covering more of the transcriptome than 454 assemblies. However, the addition of 454 significantly increased the number of genes annotated.
Project description:For this project, we have sequenced, assembled and annotated a transcriptome of a diploid wheat Triticum urartu accession PI 428198. The sequencing libraries were prepared from shoot and root tissues harvested from 2-3 week old seedlings. All sequencing was carried out on the Illumina HiSeq platform using the 100 bp pair-end protocol (248.5 million reads). The assembly was constructed using a multiple k-mer approach with a de novo assembly algorithm implemented in CLC Genomics Workbench 5.5 and additional redundancy reduction with CD-HIT and blast2cap3 programs. Open reading frames and proteins were predicted using BLASTX searches and a findorf algorithm.
Project description:Metatranscriptomics has recently been applied to investigate the active biogeochemical processes and elemental cycles, and in situ responses of microbiomes to environmental stimuli and stress factors. De novo assembly of RNA-Sequencing (RNA-Seq) data can reveal a more detailed description of the metabolic interactions amongst the active microbial communities. However, the quality of the assemblies and the depiction of the metabolic network provided by various de novo assemblers have not yet been thoroughly assessed. In this study, we compared 15 de novo metatranscriptomic assemblies for a fracture fluid sample collected from a borehole located at 1.34 km below land surface in a South African gold mine. These assemblies were constructed from total, non-coding, and coding reads using five de novo transcriptomic assemblers (Trans-ABySS, Trinity, Oases, IDBA-tran, and Rockhopper). They were evaluated based on the number of transcripts, transcript length, range of transcript coverage, continuity, percentage of transcripts with confident annotation assignments, as well as taxonomic and functional diversity patterns. The results showed that these parameters varied considerably among the assemblies, with Trans-ABySS and Trinity generating the best assemblies for non-coding and coding RNA reads, respectively, because the high number of transcripts assembled covered a wide expression range, and captured extensively the taxonomic and metabolic gene diversity, respectively. We concluded that the choice of de novo transcriptomic assemblers impacts substantially the taxonomic and functional compositions. Care should be taken to obtain high-quality assemblies for informing the in situ metabolic landscape.
Project description:BACKGROUND:Salmonid fishes exhibit high levels of phenotypic and ecological variation and are thus ideal model systems for studying evolutionary processes of adaptive divergence and speciation. Furthermore, salmonids are of major interest in fisheries, aquaculture, and conservation research. Improving understanding of the genetic mechanisms underlying traits in these species would significantly progress research in these fields. Here we generate high quality de novo transcriptomes for four salmonid species: Atlantic salmon (Salmo salar), brown trout (Salmo trutta), Arctic charr (Salvelinus alpinus), and European whitefish (Coregonus lavaretus). All species except Atlantic salmon have no reference genome publicly available and few if any genomic studies to date. RESULTS:We used paired-end RNA-seq on Illumina to generate high coverage sequencing of multiple individuals, yielding between 180 and 210 M reads per species. After initial assembly, strict filtering was used to remove duplicated, redundant, and low confidence transcripts. The final assemblies consisted of 36,505 protein-coding transcripts for Atlantic salmon, 35,736 for brown trout, 33,126 for Arctic charr, and 33,697 for European whitefish and are made publicly available. Assembly completeness was assessed using three approaches, all of which supported high quality of the assemblies: 1) ~78% of Actinopterygian single-copy orthologs were successfully captured in our assemblies, 2) orthogroup inference identified high overlap in the protein sequences present across all four species (40% shared across all four and 84% shared by at least two), and 3) comparison with the published Atlantic salmon genome suggests that our assemblies represent well covered (~98%) protein-coding transcriptomes. Thorough comparison of the generated assemblies found that 84-90% of transcripts in each assembly were orthologous with at least one of the other three species. We also identified 34-37% of transcripts in each assembly as paralogs. We further compare completeness and annotation statistics of our new assemblies to available related species. CONCLUSION:New, high-confidence protein-coding transcriptomes were generated for four ecologically and economically important species of salmonids. This offers a high quality pipeline for such complex genomes, represents a valuable contribution to the existing genomic resources for these species and provides robust tools for future investigation of gene expression and sequence evolution in these and other salmonid species.
Project description:Microalgae are photosynthetic organisms with cosmopolitan distribution (i.e., marine, freshwater and terrestrial habitats) and possess a great diversity of species  and consequently an immense variation in biochemical compositions . To date genomic information is available mainly from the model green microalga Chlamydomonas reinhardtii. Here we provide the dataset of a de novo assembly and functional annotation of the transcriptomes of three native oleaginous microalgae from the Peruvian Amazon. Native oleaginous microalgae species Ankistrodesmus sp., Chlorella sp., and Scenedesmus sp. were cultured in triplicate using Chu-10 medium with or without a source of nitrate (NaNO3). Total RNA was purified, the cDNA libraries were constructed and sequenced as paired-end reads on an Illumina HiSeq™2500 platform. Transcriptomes were de novo assembled using Trinity v2.9.1. A total of 48,554 transcripts (range from 250 to 7966?bp; N50?=?1047) for Ankistrodesmus sp., 108,126 transcripts (range from 250 to 8160?bp; N50?=?1090) for Chlorella sp., and 77,689 transcripts (range from 250 to 8481?bp; N50?=?1281) for Scenedesmus sp. were de novo assembled. Completeness of the assembled transcriptomes were evaluated with the Benchmarking Universal Single-Copy Orthologs (BUSCO) software v2/v3. Functional annotation of the assembled transcriptomes was conducted with TransDecoder v3.0.1 and the web-based platforms Kyoto Encyclopedia of Genes and Genomes (KEGG) Automatic Annotation Server (KAAS) and FunctionAnnotator. The raw reads were deposited into NCBI and are accessible via BioProject accession number PRJNA628966 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA628966) and Sequence Read Archive (SRA) with accession numbers: SRX8295665 (https://www.ncbi.nlm.nih.gov/sra/SRX8295665), SRX8295666 (https://www.ncbi.nlm.nih.gov/sra/SRX8295666), SRX8295667 (https://www.ncbi.nlm.nih.gov/sra/SRX8295667), SRX8295668 (https://www.ncbi.nlm.nih.gov/sra/SRX8295668), SRX8295669 (https://www.ncbi.nlm.nih.gov/sra/SRX8295669), and SRX8295670 (https://www.ncbi.nlm.nih.gov/sra/SRX8295670). Additionally, transcriptome shotgun assembly sequences and functional annotations are available via Discover Mendeley Data (https://data.mendeley.com/datasets/47wdjmw9xr/1).
Project description:The rhesus macaque (Macaca mulatta) is a key species for advancing biomedical research. Like all draft mammalian genomes, the draft rhesus assembly (rheMac2) has gaps, sequencing errors and misassemblies that have prevented automated annotation pipelines from functioning correctly. Another rhesus macaque assembly, CR_1.0, is also available but is substantially more fragmented than rheMac2 with smaller contigs and scaffolds. Annotations for these two assemblies are limited in completeness and accuracy. High quality assembly and annotation files are required for a wide range of studies including expression, genetic and evolutionary analyses.We report a new de novo assembly of the rhesus macaque genome (MacaM) that incorporates both the original Sanger sequences used to assemble rheMac2 and new Illumina sequences from the same animal. MacaM has a weighted average (N50) contig size of 64 kilobases, more than twice the size of the rheMac2 assembly and almost five times the size of the CR_1.0 assembly. The MacaM chromosome assembly incorporates information from previously unutilized mapping data and preliminary annotation of scaffolds. Independent assessment of the assemblies using Ion Torrent read alignments indicates that MacaM is more complete and accurate than rheMac2 and CR_1.0. We assembled messenger RNA sequences from several rhesus tissues into transcripts which allowed us to identify a total of 11,712 complete proteins representing 9,524 distinct genes. Using a combination of our assembled rhesus macaque transcripts and human transcripts, we annotated 18,757 transcripts and 16,050 genes with complete coding sequences in the MacaM assembly. Further, we demonstrate that the new annotations provide greatly improved accuracy as compared to the current annotations of rheMac2. Finally, we show that the MacaM genome provides an accurate resource for alignment of reads produced by RNA sequence expression studies.The MacaM assembly and annotation files provide a substantially more complete and accurate representation of the rhesus macaque genome than rheMac2 or CR_1.0 and will serve as an important resource for investigators conducting next-generation sequencing studies with nonhuman primates.This article was reviewed by Dr. Lutz Walter, Dr. Soojin Yi and Dr. Kateryna Makova.
Project description:Despite the ever-increasing output of next-generation sequencing data along with developing assemblers, dozens to hundreds of gaps still exist in de novo microbial assemblies due to uneven coverage and large genomic repeats. Third-generation single-molecule, real-time (SMRT) sequencing technology avoids amplification artifacts and generates kilobase-long reads with the potential to complete microbial genome assembly. However, due to the low accuracy (~85%) of third-generation sequences, a considerable amount of long reads (>50X) are required for self-correction and for subsequent de novo assembly. Recently-developed hybrid approaches, using next-generation sequencing data and as few as 5X long reads, have been proposed to improve the completeness of microbial assembly. In this study we have evaluated the contemporary hybrid approaches and demonstrated that assembling corrected long reads (by runCA) produced the best assembly compared to long-read scaffolding (e.g., AHA, Cerulean and SSPACE-LongRead) and gap-filling (SPAdes). For generating corrected long reads, we further examined long-read correction tools, such as ECTools, LSC, LoRDEC, PBcR pipeline and proovread. We have demonstrated that three microbial genomes including Escherichia coli K12 MG1655, Meiothermus ruber DSM1279 and Pdeobacter heparinus DSM2366 were successfully hybrid assembled by runCA into near-perfect assemblies using ECTools-corrected long reads. In addition, we developed a tool, Patch, which implements corrected long reads and pre-assembled contigs as inputs, to enhance microbial genome assemblies. With the additional 20X long reads, short reads of S. cerevisiae W303 were hybrid assembled into 115 contigs using the verified strategy, ECTools + runCA. Patch was subsequently applied to upgrade the assembly to a 35-contig draft genome. Our evaluation of the hybrid approaches shows that assembling the ECTools-corrected long reads via runCA generates near complete microbial genomes, suggesting that genome assembly could benefit from re-analyzing the available hybrid datasets that were not assembled in an optimal fashion.