The raw iTRAQ proteomics data of Entamoeba histolytica
ABSTRACT: The raw proteomics data in the manuscript “Single-cell RNA-sequence reveals that the switching of the transcriptional profiles of cysteine-related genes alter the virulence of Entamoeba histolytica”
Project description:BACKGROUND:Recent advances in single-cell RNA sequencing have allowed researchers to explore transcriptional function at a cellular level. In particular, single-cell RNA sequencing reveals that there exist clusters of cells with similar gene expression profiles, representing different transcriptional states. RESULTS:In this study, we present SCPPIN, a method for integrating single-cell RNA sequencing data with protein-protein interaction networks that detects active modules in cells of different transcriptional states. We achieve this by clustering RNA-sequencing data, identifying differentially expressed genes, constructing node-weighted protein-protein interaction networks, and finding the maximum-weight connected subgraphs with an exact Steiner-tree approach. As case studies, we investigate two RNA-sequencing data sets from human liver spheroids and human adipose tissue, respectively. With SCPPIN we expand the output of differential expressed genes analysis with information from protein interactions. We find that different transcriptional states have different subnetworks of the protein-protein interaction networks significantly enriched which represent biological pathways. In these pathways, SCPPIN identifies proteins that are not differentially expressed but have a crucial biological function (e.g., as receptors) and therefore reveals biology beyond a standard differential expressed gene analysis. CONCLUSIONS:The introduced SCPPIN method can be used to systematically analyse differentially expressed genes in single-cell RNA sequencing data by integrating it with protein interaction data. The detected modules that characterise each cluster help to identify and hypothesise a biological function associated to those cells. Our analysis suggests the participation of unexpected proteins in these pathways that are undetectable from the single-cell RNA sequencing data alone. The techniques described here are applicable to other organisms and tissues.
Project description:Invasive ductal carcinoma is the most common type of breast cancer. Here, we provide a whole transcriptome shotgun sequencing (called RNA-seq) dataset conducted with ten samples of invasive ductal carcinoma tissue and three samples of adjacent normal tissue from a single Korean breast cancer patient (luminal B subtype). Differentially expressed genes (DEGs) were identified with a false discovery rate (FDR)-adjusted p-value of 0.05. Gene ontology analysis identified several key pathways, including lymphocyte activation. A list of differentially expressed genes is provided. The raw data was uploaded to the sequence read archive (SRA) database and the BioProject ID is PRJNA432903.
Project description:OBJECTIVE:The data presented herein represents the raw genotype data of a recently conducted larger study which investigated the association of single nucleotide polymorphisms (SNPs) in breast cancer related genes with the risk and clinicopathological profiles of sporadic breast cancer among Sri Lankan women. A case-control study design was adopted to conduct SNP marker disease association testing in an existing blood resource obtained from a cohort of Sri Lankan postmenopausal women with clinically phenotyped sporadic breast cancer and healthy postmenopausal women. The list of haplotype-tagging SNP markers for genotyping was selected based on information available in the published literature and use of bioinformatics tools and databases. Genotyping of 57 selected SNPs in 36 breast cancer related genes was performed using the iPLEX Sequenom Mass-Array platform. DATA DESCRIPTION:The raw genotype data for the 57 SNPs genotyped in 350 women with breast cancer and 350 healthy women are presented in this article. This data might be relevant to other researchers involved in investigating the role of SNPs in breast cancer related genes with the risk of sporadic breast cancer in South Asian populations.
Project description:We present an R based pipeline, ArrayExpressHTS, for pre-processing, expression estimation and data quality assessment of high-throughput sequencing transcriptional profiling (RNA-seq) datasets. The pipeline starts from raw sequence files and produces standard Bioconductor R objects containing gene or transcript measurements for downstream analysis along with web reports for data quality assessment. It may be run locally on a user's own computer or remotely on a distributed R-cloud farm at the European Bioinformatics Institute. It can be used to analyse user's own datasets or public RNA-seq datasets from the ArrayExpress Archive.The R package is available at www.ebi.ac.uk/tools/rcloud with online documentation at www.ebi.ac.uk/Tools/rwiki/, also available as supplementary material.
Project description:BACKGROUND: Genetically identical populations of cells grown in the same environmental condition show substantial variability in gene expression profiles. Although single-cell RNA-seq provides an opportunity to explore this phenomenon, statistical methods need to be developed to interpret the variability of gene expression counts. RESULTS: We develop a statistical framework for studying the kinetics of stochastic gene expression from single-cell RNA-seq data. By applying our model to a single-cell RNA-seq dataset generated by profiling mouse embryonic stem cells, we find that the inferred kinetic parameters are consistent with RNA polymerase II binding and chromatin modifications. Our results suggest that histone modifications affect transcriptional bursting by modulating both burst size and frequency. Furthermore, we show that our model can be used to identify genes with slow promoter kinetics, which are important for probabilistic differentiation of embryonic stem cells. CONCLUSIONS: We conclude that the proposed statistical model provides a flexible and efficient way to investigate the kinetics of transcription.
Project description:Although the privacy issues in human genomic studies are well known, the privacy risks in clinical proteomic data have not been thoroughly studied. As a proof of concept, we reported a comprehensive analysis of the privacy risks in clinical proteomic data. It showed that a small number of peptides carrying the minor alleles (referred to as the minor allelic peptides) at non-synonymous single nucleotide polymorphism (nsSNP) sites can be identified in typical clinical proteomic datasets acquired from the blood/serum samples of individual patient, from which the patient can be identified with high confidence. Our results suggested the presence of significant privacy risks in raw clinical proteomic data. However, these risks can be mitigated by a straightforward pre-processing step of the raw data that removing a very small fraction (0.1%, 7.14 out of 7,504 spectra on average) of MS/MS spectra identified as the minor allelic peptides, which has little or no impact on the subsequent analysis (and re-use) of these datasets.
Project description:Stemformatics is an established gene expression data portal containing over 420 public gene expression datasets derived from microarray, RNA sequencing and single cell profiling technologies. Developed for the stem cell community, it has a major focus on pluripotency, tissue stem cells, and staged differentiation. Stemformatics includes curated 'collections' of data relevant to cell reprogramming, as well as hematopoiesis and leukaemia. Rather than simply rehosting datasets as they appear in public repositories, Stemformatics uses a stringent set of quality control metrics and its own pipelines to process handpicked datasets from raw files. This means that about 30% of datasets processed by Stemformatics fail the quality control metrics and never make it to the portal, ensuring that Stemformatics data are of high quality and have been processed in a consistent manner. Stemformatics provides easy-to-use and intuitive tools for biologists to visually explore the data, including interactive gene expression profiles, principal component analysis plots and hierarchical clusters, among others. The addition of tools that facilitate cross-dataset comparisons provides users with snapshots of gene expression in multiple cell and tissues, assisting the identification of cell-type restricted genes, or potential housekeeping genes. Stemformatics is freely available at stemformatics.org.
Project description:Autophagy contributes to reorganizing intracellular components and forming fat droplets during the adipocyte differentiation. Here, we systematically describe the role of autophagy-related genes and gene sets during the differentiation of adipocytes. We used a public dataset from the European Nucleotide Archive from an RNA-seq experiment in which 3T3-L1 cells were induced by a differentiation induction medium, total RNA was extracted and sequenced at four different time points. Raw reads were aligned to the UCSC mouse reference genome (mm10) using HISAT2, and aligned reads were summarized at the gene or exon level using HTSeq. DESeq2 and DEXSeq were used to model the gene and exon counts and test for differential expression and relative exon usage, respectively. After applying the appropriate transformation, gene counts were used to perform the gene set and pathway enrichment analysis. Data were obtained, processed and annotated using R and Bioconductor. Several autophagy-related genes and autophagy gene sets, as defined in the Gene Ontology, were actively regulated during the course of the adipocyte differentiation. We further characterized these gene sets by clustering their members to a few distinct temporal profiles. Other potential functionally related genes were identified using a machine learning procedure. In summary, we characterized the autophagy gene sets and their members to biologically meaningful groups and elected a number of genes to be functionally related based on their expression patterns, suggesting that autophagy plays a critical role in removal of some intracellular components and supply of energy sources for lipid biogenesis during adipogenesis.
Project description:The CXCR5 (C-X-C motif chemokine receptor 5) is chemokine transmembrane receptor, acting via its ligand CXCL13 and plays a crucial role in controlling the trafficking of inflammatory cells into and from the sub-retinal space, which contributes to the pathogenesis of AMD. We have previously described the genetic ablation of CXCR5 deficiency causes RPE/choroid abnormalities and retinal degeneration (RD) in aged mice. Here we report the transcriptome data (RNA-Seq) of 24 months old CXCR5 knockout (KO) and age-matched C57BL/6 controls (WT). RNA sequencing was performed on the Illumina HiSeq 2500, providing up to 300 GB of sequence information per flow cell. The quality of RNA-seq libraries, RNA intensity were validated by Agilent Technologies Bioanalyzer-2100. The raw datasets contains on average 292,004,59 reads (after trimming 284,862,43 reads) in retina and 272,527,90 reads (after trimming 266,173,11 reads) in choroid samples. The mapped reads showed that a total of 1586 genes in retina and 1462 genes in choroid are differentially expressed in this experiment. The raw datasets were deposited into NCBI Sequence Read Archive (SRA) database and can be accessed via accession number PRJNA588421.
Project description:Lentinula edodes is one of the most popular edible mushrooms in the world and contains useful medicinal components such as lentinan. The whole-genome sequence of L. edodes has been determined with the objective of discovering candidate genes associated with agronomic traits, but experimental verification of gene models with correction of gene prediction errors is lacking. To improve the accuracy of gene prediction, we produced 12.6 Gb of long-read transcriptome data of variable lengths using PacBio single-molecule real-time (SMRT) sequencing and generated 36,946 transcript clusters with an average length of 2.2 kb. Evidence-driven gene prediction on the basis of long- and short-read RNA sequencing data was performed; a total of 16,610 protein-coding genes were predicted with error correction. Of the predicted genes, 42.2% were verified to be covered by full-length transcript clusters. The raw reads have been deposited in the NCBI SRA database under accession number PRJNA396788.