ABSTRACT: Our modified PANDAseq (Assembler) performs a modified statistical analysis using the sequencer supplied quality (Q) scores to find the most likely overlap, computes assembled Q scores for the read overlap region, and handles more complex overlap layouts
Project description:Here we describe a custom FMDV microarray and a companion feature and template-assisted assembler software (FAT-assembler) capable of resolving virus genome sequence using a moderate number of conserved microarray features. The results demonstrate that this approach may be used to rapidly characterize naturally occurring FMDV as well as an engineered chimeric strain of FMDV. The FAT-assembler, while applied to resolving FMDV genomes, represents a new bioinformatics approach that should be broadly applicable to interpreting microarray genotyping data for other viruses or target organisms
Project description:Structure probing coupled with high-throughput sequencing holds the potential to revolutionize our understanding of the role of RNA structure in regulation of gene expression. Despite major technological advances, intrinsic noise and high coverage requirements greatly limit the applicability of these techniques. Here we describe a probabilistic modeling pipeline which accounts for biological variability and biases in the data, yielding statistically interpretable scores for the probability of nucleotide modification transcriptome-wide. We demonstrate on two yeast data sets that our method has greatly increased sensitivity, enabling the identification of modified regions on many more transcripts compared with existing pipelines. It also provides confident predictions at much lower coverage levels than previously reported. Our results show that statistical modeling greatly extends the scope and potential of transcriptome-wide structure probing experiments.
Project description:Epithelial cells were isolated by FACS from the mammary glands of pubescent (5 week old), estrus adult (10 week old) and diestrus adult (10 week old) female mice. Freshly sorted cells were submitted to a 10X Genomics Chromium System for single cell capture. cDNA synthesis and library preparation was done according to the protocol supplied by the manufacturer. Sequencing was carried out on an Illumina NextSeq500 sequencer using parameters recommended by 10X Genomics.
Project description:Epithelial cells were isolated by FACS from the mammary glands of adult (10 week old) female mice. A basal subpopulation of the epithelial cells was also isolated. Freshly sorted cells were submitted to a 10X Genomics Chromium System for single cell capture. cDNA synthesis and library preparation was done according to the protocol supplied by the manufacturer. Sequencing was carried out on an Illumina NextSeq500 sequencer to achieve 75 bp paired-end reads.
Project description:Proteogenomics methods have identified many non-annotated protein-coding genes in the human genome. Many of the newly discovered protein-coding genes encode peptides and small proteins, referred to collectively as microproteins. Microproteins are produced through ribosome translation of small open reading frames (smORFs). The discovery of many smORFs reveals a blind spot in traditional gene-finding algorithms for these genes. Biological studies have found roles for microproteins in cell biology and physiology, and the potential that there exists additional bioactive microproteins drives the interest in detection and discovery of these molecules. A key step in any proteogenomics workflow is the assembly of RNA-Seq data into likely mRNA transcrips that are then used to create a searchable protein databases. Here we demonstrate that specific features of the assembled transcriptome impact microprotein detection by shotgun proteomics. By tailoring transcript assembly for downstream mass spectrometry searching, we show that we can detect more than double the number of high-quality microprotein candidates and introduce a novel open-source mRNA assembler for proteogenomics (MAPS) that incorporates all of these features. By integrating our specialized assembler, MAPS, and a popular generalized assembler into our proteogenomics pipeline, we detect 45 novel human microproteins from a high quality proteogenomics dataset of a human cell line. We then characterize the features of the novel microproteins, identifying two classes of microproteins. Our work highlights the importance of specialized transcriptome assembly upstream of proteomics validation when searching for short and potentially rare and poorly conserved proteins.
Project description:Phosphoproteomics methods are commonly employed in labs to identify and quantify the sites of phosphorylation on proteins. In recent years, various software tools have been developed, incorporating scores or statistics related to whether a given phosphosite has been correctly identified, or to estimate the global false localisation rate (FLR) within a given data set for all sites reported. These scores have generally been calibrated using synthetic data sets, and their statistical reliability on real datasets is largely unknown. As a result, there is considerable problem in the field of reporting incorrectly localised phosphosites, due to inadequate statistical control. In this work, we develop the concept of using scoring and ranking modifications on a decoy amino acid, i.e. one that cannot be modified, to allow for independent estimation of global FLR. We test a variety of different amino acids to act as the decoy, on both synthetic and real data sets, demonstrating that the amino acid selection can make a substantial difference to the estimated global FLR. We conclude that while several different amino acids might be appropriate, the most reliable FLR results were achieved using alanine and leucine as decoys, although we have a preference for alanine due to the risk of potential confusion between leucine and isoleucine amino acids. We propose that the phosphoproteomics field should adopt the use of a decoy amino acid, so that there is better control of false reporting in the literature, and in public databases that re-distribute the data.
Project description:Background: The clinical and pathologic diversity of systemic lupus erythematosus (SLE) has hindered diagnosis, management, and treatment development. This study clustered adult SLE patients through comprehensive molecular phenotyping to improve distinctions with prognostic and therapeutic relevance. Methods: Plasma, serum, and RNA were collected from 198 adult SLE patients. Disease activity was scored by modified SELENA-SLEDAI. Twenty-nine co-expression module scores were calculated from microarray gene-expression data. Plasma soluble mediators (n=23) and autoantibodies (n=13) were assessed by multiplex bead-based assays and ELISAs. Phenotypic patient clusters were identified by machine learning combining K-means clustering and random forest analysis of co-expression module scores and soluble mediators. Findings: SLEDAI scores correlated strongly with interferon module scores, more modestly with plasma cell and select cell cycle modules, and with circulating IFNα, IL21, IL1α, IL17A, IP10, and MIG levels. Co-expression modules and soluble mediators differentiated seven clusters of SLE patients with unique molecular phenotypes. Inflammation and interferon modules were elevated in Clusters 1 (moderately) and 4 (strongly), with decreased T cell modules in Cluster 4. The other clusters differed in monocyte, neutrophil, plasmablast, B cell, and T cell modules. Clusters 1 and 4 had higher SLEDAI scores, and more frequent anti-dsDNA, low complement, and renal activity. These features were also prominent in Cluster 3, which lacked the interferon and inflammation signatures. Arthritis and rashes were common in all clusters. Interpretation: Molecular profiles can distinguish SLE subsets. Prospective longitudinal studies of these profiles may help to improve prognostic evaluation, clinical trial design, and precision medicine approaches.
Project description:PRDM9 is a histone methyltransferase expressed in meiotic germ cells that determines the location of genetic recombination hotspots through binding of its allele-specific DNA binding domain. Here we characterize the genome-wide chromatin modification for two human PRDM9 alleles (A and C) in human cell lines. HEK293 cells were transfected with both alleles and an empty vector control. Resulting chromatin was subjected to H3K4me3 ChIP followed by high-throughput sequencing. We find that different PRDM9 allele largely modified chromatin in entirely different genomic regions in somatic cells determined by the protein's zinc-finger DNA binding domains. Many of the allele-specific peaks overlap sites of meiotic double-strand breaks found in vivo in human germ cells suggesting that transient expression of PRDM9 in somatic cells can reflect binding in vivo. Identify PRDM9-dependent H3K4me3 sites by comparing modified chromatin after expression of different human PRDM9 alleles in HEK293 cells.
Project description:Clear cell renal cell carcinoma is the most common type of renal cancers, which forms tumors strongly supplied with blood vessels, here we wanted to check the exprresion of genes on different stages of tumor progression, and find which of them changes significantly with increased grade.