Novel Data Transformations for RNA-seq Differential Expression Analysis.
ABSTRACT: We propose eight data transformations (r, r2, rv, rv2, l, l2, lv, and lv2) for RNA-seq data analysis aiming to make the transformed sample mean to be representative of the distribution center since it is not always possible to transform count data to satisfy the normality assumption. Simulation studies showed that for data sets with small (e.g., nCases?=?nControls?=?3) or large sample size (e.g., nCases?=?nControls?=?100) limma based on data from the l, l2, and r2 transformations performed better than limma based on data from the voom transformation in term of accuracy, FDR, and FNR. For datasets with moderate sample size (e.g., nCases?=?nControls?=?30 or 50), limma with the rv and rv2 transformations performed similarly to limma with the voom transformation. Real data analysis results are consistent with simulation analysis results: limma with the r, l, r2, and l2 transformation performed better than limma with the voom transformation when sample sizes are small or large; limma with the rv and rv2 transformations performed similarly to limma with the voom transformation when sample sizes are moderate. We also observed from our data analyses that for datasets with large sample size, the gene-selection via the Wilcoxon rank sum test (a non-parametric two sample test method) based on the raw data outperformed limma based on the transformed data.
Project description:Horses belong to the order Perissodactyla and bear the majority of their weight on their third toe; therefore, tremendous force is applied to each hoof. An inherited disease characterized by a phenotype restricted to the dorsal hoof wall was identified in the Connemara pony. Hoof wall separation disease (HWSD) manifests clinically as separation of the dorsal hoof wall along the weight-bearing surface of the hoof during the first year of life. Parents of affected ponies appeared clinically normal, suggesting an autosomal recessive mode of inheritance. A case-control allelic genome wide association analysis was performed (ncases = 15, ncontrols = 24). Population stratification (? = 1.48) was successfully improved by removing outliers (ncontrols = 7) identified on a multidimensional scaling plot. A genome-wide significant association was detected on chromosome 8 (praw = 1.37x10-10, pgenome = 1.92x10-5). A homozygous region identified in affected ponies spanned from 79,936,024-81,676,900 bp and contained a family of 13 annotated SERPINB genes. Whole genome next-generation sequencing at 6x coverage of two cases and two controls revealed 9,758 SNVs and 1,230 indels within the ~1.7-Mb haplotype, of which 17 and 5, respectively, segregated with the disease and were located within or adjacent to genes. Additional genotyping of these 22 putative functional variants in 369 Connemara ponies (ncases = 23, ncontrols = 346) and 169 horses of other breeds revealed segregation of three putative variants adjacent or within four SERPIN genes. Two of the variants were non-coding and one was an insertion within SERPINB11 that introduced a frameshift resulting in a premature stop codon. Evaluation of mRNA levels at the proximal hoof capsule (ncases = 4, ncontrols = 4) revealed that SERPINB11 expression was significantly reduced in affected ponies (p<0.001). Carrier frequency was estimated at 14.8%. This study describes the first genetic variant associated with a hoof wall specific phenotype and suggests a role of SERPINB11 in maintaining hoof wall structure.
Project description:BACKGROUND:There have been considerable recent advances in understanding the genetic architecture of anxiety disorders and posttraumatic stress disorder (PTSD), as well as the underlying neurocircuitry of these disorders. However, there is little work on the concordance of genetic variations that increase risk for these conditions, and that influence subcortical brain structures. We undertook a genome-wide investigation of the overlap between the genetic influences from single nucleotide polymorphisms (SNPs) on volumes of subcortical brain structures and genetic risk for anxiety disorders and PTSD. METHOD:We obtained summary statistics of genome-wide association studies (GWAS) of anxiety disorders (Ncases?=?7016, Ncontrols?=?14,745), PTSD (European sample; Ncases?=?2424, Ncontrols?=?7113) and of subcortical brain structures (N?=?13,171). SNP Effect Concordance Analysis (SECA) and Linkage Disequilibrium (LD) Score Regression were used to examine genetic pleiotropy, concordance, and genome-wide correlations respectively. SECAs conditional false discovery was used to identify specific risk variants associated with anxiety disorders or PTSD when conditioning on brain related traits. RESULTS:For anxiety disorders, we found evidence of significant concordance between increased anxiety risk variants and variants associated with smaller amygdala volume. Further, by conditioning on brain volume GWAS, we identified novel variants that associate with smaller brain volumes and increase risk for disorders: rs56242606 was found to increase risk for anxiety disorders, while two variants (rs6470292 and rs683250) increase risk for PTSD, when conditioning on the GWAS of putamen volume. LIMITATIONS:Despite using the largest available GWAS summary statistics, the analyses were limited by sample size. CONCLUSIONS:These preliminary data indicate that there is genome wide concordance between genetic risk factors for anxiety disorders and those for smaller amygdala volume, which is consistent with research that supports the involvement of the amygdala in anxiety disorders. It is notable that a genetic variant that contributes to both reduced putamen volume and PTSD plays a key role in the glutamatergic system. Further work with GWAS summary statistics from larger samples, and a more extensive look at the genetics underlying brain circuits, is needed to fully delineate the genetic architecture of these disorders and their underlying neurocircuitry.
Project description:Osteoarthritis (OA) is a common complex disease with high public health burden and no curative therapy. High bone mineral density (BMD) is associated with an increased risk of developing OA, suggesting a shared underlying biology. Here, we performed the first systematic overlap analysis of OA and BMD on a genome wide scale. We used summary statistics from the GEFOS consortium for lumbar spine (n?=?31,800) and femoral neck (n?=?32,961) BMD, and from the arcOGEN consortium for three OA phenotypes (hip, ncases=3,498; knee, ncases=3,266; hip and/or knee, ncases=7,410; ncontrols=11,009). Performing LD score regression we found a significant genetic correlation between the combined OA phenotype (hip and/or knee) and lumbar spine BMD (rg=0.18, P?=?2.23?×?10-2), which may be driven by the presence of spinal osteophytes. We identified 143 variants with evidence for cross-phenotype association which we took forward for replication in independent large-scale OA datasets, and subsequent meta-analysis with arcOGEN for a total sample size of up to 23,425 cases and 236,814 controls. We found robustly replicating evidence for association with OA at rs12901071 (OR 1.08?95% CI 1.05-1.11, Pmeta=3.12?×?10-10), an intronic variant in the SMAD3 gene, which is known to play a role in bone remodeling and cartilage maintenance. We were able to confirm expression of SMAD3 in intact and degraded cartilage of the knee and hip. Our findings provide the first systematic evaluation of pleiotropy between OA and BMD, highlight genes with biological relevance to both traits, and establish a robust new OA genetic risk locus at SMAD3.
Project description:New normal linear modeling strategies are presented for analyzing read counts from RNA-seq experiments. The voom method estimates the mean-variance relationship of the log-counts, generates a precision weight for each observation and enters these into the limma empirical Bayes analysis pipeline. This opens access for RNA-seq analysts to a large body of methodology developed for microarrays. Simulation studies show that voom performs as well or better than count-based RNA-seq methods even when the data are generated according to the assumptions of the earlier methods. Two case studies illustrate the use of linear modeling and gene set testing methods.
Project description:The use of RNA-seq as the preferred method for the discovery and validation of small RNA biomarkers has been hindered by high quantitative variability and biased sequence counts. In this paper we develop a statistical model for sequence counts that accounts for ligase bias and stochastic variation in sequence counts. This model implies a linear quadratic relation between the mean and variance of sequence counts. Using a large number of sequencing datasets, we demonstrate how one can use the generalized additive models for location, scale and shape (GAMLSS) distributional regression framework to calculate and apply empirical correction factors for ligase bias. Bias correction could remove more than 40% of the bias for miRNAs. Empirical bias correction factors appear to be nearly constant over at least one and up to four orders of magnitude of total RNA input and independent of sample composition. Using synthetic mixes of known composition, we show that the GAMLSS approach can analyze differential expression with greater accuracy, higher sensitivity and specificity than six existing algorithms (DESeq2, edgeR, EBSeq, limma, DSS, voom) for the analysis of small RNA-seq data.
Project description:Spontaneous coronary artery dissection (SCAD) is a non-atherosclerotic cause of myocardial infarction (MI), typically in young women. We undertook a genome-wide association study of SCAD (Ncases?=?270/Ncontrols?=?5,263) and identified and replicated an association of rs12740679 at chromosome 1q21.2 (Pdiscovery+replication?=?2.19?×?10-12, OR?=?1.8) influencing ADAMTSL4 expression. Meta-analysis of discovery and replication samples identified associations with P?<?5?×?10-8 at chromosome 6p24.1 in PHACTR1, chromosome 12q13.3 in LRP1, and in females-only, at chromosome 21q22.11 near LINC00310. A polygenic risk score for SCAD was associated with (1) higher risk of SCAD in individuals with fibromuscular dysplasia (P?=?0.021, OR?=?1.82?[95%?CI:?1.09-3.02]) and (2) lower risk of atherosclerotic coronary artery disease and MI in the UK Biobank (P?=?1.28?×?10-17, HR?=?0.91?[95%?CI?:0.89-0.93], for MI) and Million Veteran Program (P?=?9.33?×?10-36, OR?=?0.95?[95%?CI:?0.94-0.96], for CAD; P?=?3.35?×?10-6, OR?=?0.96?[95%?CI:?0.95-0.98] for MI). Here we report that SCAD-related MI and atherosclerotic MI exist at opposite ends of a genetic risk spectrum, inciting MI with disparate underlying vascular biology.
Project description:The ability to easily and efficiently analyse RNA-sequencing data is a key strength of the Bioconductor project. Starting with counts summarised at the gene-level, a typical analysis involves pre-processing, exploratory data analysis, differential expression testing and pathway analysis with the results obtained informing future experiments and validation studies. In this workflow article, we analyse RNA-sequencing data from the mouse mammary gland, demonstrating use of the popular edgeR package to import, organise, filter and normalise the data, followed by the limma package with its voom method, linear modelling and empirical Bayes moderation to assess differential expression and perform gene set testing. This pipeline is further enhanced by the Glimma package which enables interactive exploration of the results so that individual samples and genes can be examined by the user. The complete analysis offered by these three packages highlights the ease with which researchers can turn the raw counts from an RNA-sequencing experiment into biological insights using Bioconductor.
Project description:Genome-wide association studies (GWAS) have identified SNPs in six genes that are associated with childhood acute lymphoblastic leukemia (ALL). A lead SNP was found to occur on chromosome 9p21.3, a region that is deleted in 30% of childhood ALLs, suggesting the presence of causal polymorphisms linked to ALL risk. We used SNP genotyping and imputation-based fine-mapping of a multiethnic ALL case-control population (Ncases = 1,464, Ncontrols = 3,279) to identify variants of large effect within 9p21.3. We identified a CDKN2A missense variant (rs3731249) with 2% allele frequency in controls that confers three-fold increased risk of ALL in children of European ancestry (OR, 2.99; P = 1.51 × 10(-9)) and Hispanic children (OR, 2.77; P = 3.78 × 10(-4)). Moreover, of 17 patients whose tumors displayed allelic imbalance at CDKN2A, 14 preferentially retained the risk allele and lost the protective allele (PBinomial = 0.006), suggesting that the risk allele provides a selective advantage during tumor growth. Notably, the CDKN2A variant was not significantly associated with melanoma, glioblastoma, or pancreatic cancer risk, implying that this polymorphism specifically confers ALL risk but not general cancer risk. Taken together, our findings demonstrate that coding polymorphisms of large effect can underlie GWAS "hits" and that inherited polymorphisms may undergo directional selection during clonal expansion of tumors.
Project description:Non-syndromic mitral valve prolapse (MVP) is a common degenerative valvulopathy, predisposing to arrhythmia and sudden death. The etiology of MVP is suspected to be under genetic control, as supported by familial cases and its manifestation in genetic syndrome (e.g., Marfan syndrome). One candidate etiological mechanism is a perturbation of the extracellular matrix (ECM) remodeling of the valve. To test this hypothesis, we assessed the role of genetic variants in the matrix metalloproteinase 2 gene (MMP2) known to regulate the ECM turnover by direct degradation of proteins and for which transgenic mice develop MVP. Direct sequencing of exons of MMP2 in 47 unrelated patients and segregation analyses in families did not reveal any causative mutation. We studied eight common single nucleotide polymorphisms (TagSNPs), which summarize the genetic information at the MMP2 locus. The association study in two case controls sets (NCases = 1073 and NControls = 1635) provided suggestive evidence for the association of rs1556888 located downstream MMP2 with the risk of MVP, especially in patients with the fibroelastic defiency form. Our study does not support the contribution of MMP2 rare variation in the etiology to MVP in humans, though further genetic and molecular investigation is required to confirm our current suggestive association of one common variant.
Project description:BACKGROUND:A recent genome-wide association study (GWAS) of autism spectrum disorder (ASD) (ncases = 18,381, ncontrols = 27,969) has provided novel opportunities for investigating the etiology of ASD. Here, we integrate the ASD GWAS summary statistics with summary-level gene expression data to infer differential gene expression in ASD, an approach called transcriptome-wide association study (TWAS). METHODS:Using FUSION software, ASD GWAS summary statistics were integrated with predictors of gene expression from 16 human datasets, including adult and fetal brains. A novel adaptation of established statistical methods was then used to test for enrichment within candidate pathways and specific tissues and at different stages of brain development. The proportion of ASD heritability explained by predicted expression of genes in the TWAS was estimated using stratified linkage disequilibrium score regression. RESULTS:This study identified 14 genes as significantly differentially expressed in ASD, 13 of which were outside of known genome-wide significant loci (±500 kb). XRN2, a gene proximal to an ASD GWAS locus, was inferred to be significantly upregulated in ASD, providing insight into the functional consequence of this associated locus. One novel transcriptome-wide significant association from this study is the downregulation of PDIA6, which showed minimal evidence of association in the GWAS, and in gene-based analysis using MAGMA. Predicted gene expression in this study accounted for 13.0% of the total ASD single nucleotide polymorphism heritability. CONCLUSIONS:This study has implicated several genes as significantly up/downregulated in ASD, providing novel and useful information for subsequent functional studies. This study also explores the utility of TWAS-based enrichment analysis and compares TWAS results with a functionally agnostic approach.