Project description:MotivationFor many classes of disease the same genetic risk variants underly many related phenotypes or disease subtypes. Multinomial logistic regression provides an attractive framework to analyze multi-category phenotypes, and explore the genetic relationships between these phenotype categories. We introduce Trinculo, a program that implements a wide range of multinomial analyses in a single fast package that is designed to be easy to use by users of standard genome-wide association study software.Availability and implementationAn open source C implementation, with code and binaries for Linux and Mac OSX, is available for download at http://sourceforge.net/projects/trinculoSupplementary informationSupplementary data are available at Bioinformatics online.Contactlj4@well.ox.ac.uk.
Project description:To detect genetic association with common and complex diseases, many statistical tests have been proposed for candidate gene or genome-wide association studies with the case-control design. Due to linkage disequilibrium (LD), multi-marker association tests can gain power over single-marker tests with a Bonferroni multiple testing adjustment. Among many existing multi-marker association tests, most target to detect only one of many possible aspects in distributional differences between the genotypes of cases and controls, such as allele frequency differences, while a few new ones aim to target two or three aspects, all of which can be implemented in logistic regression. In contrast to logistic regression, a genomic distance-based regression (GDBR) approach aims to detect some high-order genotypic differences between cases and controls. A recent study has confirmed the high power of GDBR tests. At this moment, the popular logistic regression and the emerging GDBR approaches are completely unrelated; for example, one has to choose between the two. In this article, we reformulate GDBR as logistic regression, opening a venue to constructing other powerful tests while overcoming some limitations of GDBR. For example, asymptotic distributions can replace time-consuming permutations for deriving P-values and covariates, including gene-gene interactions, can be easily incorporated. Importantly, this reformulation facilitates combining GDBR with other existing methods in a unified framework of logistic regression. In particular, we show that Fisher's P-value combining method can boost statistical power by incorporating information from allele frequencies, Hardy-Weinberg disequilibrium, LD patterns, and other higher-order interactions among multi-markers as captured by GDBR.
Project description:We describe an efficient Bayesian parallel GPU implementation of two classic statistical models-the Lasso and multinomial logistic regression. We focus on parallelizing the key components: matrix multiplication, matrix inversion, and sampling from the full conditionals. Our GPU implementations of Bayesian Lasso and multinomial logistic regression achieve 100-fold speedups on mid-level and high-end GPUs. Substantial speedups of 25 fold can also be achieved on older and lower end GPUs. Samplers are implemented in OpenCL and can be used on any type of GPU and other types of computational units, thereby being convenient and advantageous in practice compared to related work.
Project description:BackgroundMixed linear models (MLM) have been widely used to account for population structure in case-control genome-wide association studies, the status being analyzed as a quantitative phenotype. Chen et al. proved in 2016 that this method is inappropriate in some situations and proposed GMMAT, a score test for the mixed logistic regression (MLR). However, this test does not produces an estimation of the variants' effects. We propose two computationally efficient methods to estimate the variants' effects. Their properties and those of other methods (MLM, logistic regression) are evaluated using both simulated and real genomic data from a recent GWAS in two geographically close population in West Africa.ResultsWe show that, when the disease prevalence differs between population strata, MLM is inappropriate to analyze binary traits. MLR performs the best in all circumstances. The variants' effects are well evaluated by our methods, with a moderate bias when the effect sizes are large. Additionally, we propose a stratified QQ-plot, enhancing the diagnosis of p values inflation or deflation when population strata are not clearly identified in the sample.ConclusionThe two proposed methods are implemented in the R package milorGWAS available on the CRAN. Both methods scale up to at least 10,000 individuals. The same computational strategies could be applied to other models (e.g. mixed Cox model for survival analysis).
Project description:BackgroundAccurate differential diagnosis of neurodegenerative parkinsonisms is challenging due to overlapping early symptoms and high rates of misdiagnosis. To improve the diagnostic accuracy, we developed an integrated classification algorithm using multinomial logistic regression and Scaled Subprofile Model/Principal Component Analysis (SSM/PCA) applied to 18F-fluorodeoxyglucose positron emission tomography (FDG-PET) brain images. In this novel classification approach, SSM/PCA is applied to FDG-PET brain images of patients with various parkinsonisms, which are compared against the constructed undetermined images. This process involves spatial normalization of the images and dimensionality reduction via PCA. The resulting principal components are then used in a multinomial logistic regression model, which generates disease-specific topographies that can be used to classify new patients. The algorithm was trained and optimized on a cohort of patients with neurodegenerative parkinsonisms and subsequently validated on a separate cohort of patients with parkinsonisms.ResultsThe Area Under the Curve (AUC) values were the highest for progressive supranuclear palsy (PSP) (AUC = 0.95), followed by Parkinson's disease (PD) (AUC = 0.93) and multiple system atrophy (MSA) (AUC = 0.90). When classifying the patients based on their calculated probability for each group, the desired tradeoff between sensitivity and specificity had to be selected. With a 99% probability threshold for classification into a disease group, 82% of PD patients, 29% of MSA patients, and 77% of PSP patients were correctly identified. Only 5% of PD, 6% of MSA and 6% of PSP patients were misclassified, whereas the remaining patients (13% of PD, 65% of MSA and 18% of PSP) are undetermined by our classification algorithm.ConclusionsCompared to existing algorithms, this approach offers comparable accuracy and reliability in diagnosing PD, MSA, and PSP with no need of healthy control images. It can also distinguish between multiple types of parkinsonisms simultaneously and offers the flexibility to easily accommodate new groups.
Project description:Many diseases such as cancer and heart diseases are heterogeneous and it is of great interest to study the disease risk specific to the subtypes in relation to genetic and environmental risk factors. However, due to logistic and cost reasons, the subtype information for the disease is missing for some subjects. In this article, we investigate methods for multinomial logistic regression with missing outcome data, including a bootstrap hot deck multiple imputation (BHMI), simple inverse probability weighted (SIPW), augmented inverse probability weighted (AIPW), and expected estimating equation (EEE) estimators. These methods are important approaches for missing data regression. The BHMI modifies the standard hot deck multiple imputation method such that it can provide valid confidence interval estimation. Under the situation when the covariates are discrete, the SIPW, AIPW, and EEE estimators are numerically identical. When the covariates are continuous, nonparametric smoothers can be applied to estimate the selection probabilities and the estimating scores. These methods perform similarly. Extensive simulations show that all of these methods yield unbiased estimators while the complete-case (CC) analysis can be biased if the missingness depends on the observed data. Our simulations also demonstrate that these methods can gain substantial efficiency compared with the CC analysis. The methods are applied to a colorectal cancer study in which cancer subtype data are missing among some study individuals.
Project description:Recent ecological analyses suggest air pollution exposure may increase susceptibility to and severity of coronavirus disease 2019 (COVID-19). Individual-level studies are needed to clarify the relationship between air pollution exposure and COVID-19 outcomes. We conduct an individual-level analysis of long-term exposure to air pollution and weather on peak COVID-19 severity. We develop a Bayesian multinomial logistic regression model with a multiple imputation approach to impute partially missing health outcomes. Our approach is based on the stick-breaking representation of the multinomial distribution, which offers computational advantages, but presents challenges in interpreting regression coefficients. We propose a novel inferential approach to address these challenges. In a simulation study, we demonstrate our method's ability to impute missing outcome data and improve estimation of regression coefficients compared to a complete case analysis. In our analysis of 55,273 COVID-19 cases in Denver, Colorado, increased annual exposure to fine particulate matter in the year prior to the pandemic was associated with increased risk of severe COVID-19 outcomes. We also found COVID-19 disease severity to be associated with interactions between exposures. Our individual-level analysis fills a gap in the literature and helps to elucidate the association between long-term exposure to air pollution and COVID-19 outcomes.
Project description:AimsMultinomial logistic regression models allow one to predict the risk of a categorical outcome with > 2 categories. When developing such a model, researchers should ensure the number of participants (n) is appropriate relative to the number of events (Ek) and the number of predictor parameters (pk) for each category k. We propose three criteria to determine the minimum n required in light of existing criteria developed for binary outcomes.Proposed criteriaThe first criterion aims to minimise the model overfitting. The second aims to minimise the difference between the observed and adjusted R2 Nagelkerke. The third criterion aims to ensure the overall risk is estimated precisely. For criterion (i), we show the sample size must be based on the anticipated Cox-snell R2 of distinct 'one-to-one' logistic regression models corresponding to the sub-models of the multinomial logistic regression, rather than on the overall Cox-snell R2 of the multinomial logistic regression.Evaluation of criteriaWe tested the performance of the proposed criteria (i) through a simulation study and found that it resulted in the desired level of overfitting. Criterion (ii) and (iii) were natural extensions from previously proposed criteria for binary outcomes and did not require evaluation through simulation.SummaryWe illustrated how to implement the sample size criteria through a worked example considering the development of a multinomial risk prediction model for tumour type when presented with an ovarian mass. Code is provided for the simulation and worked example. We will embed our proposed criteria within the pmsampsize R library and Stata modules.