Project description:An ever‑increasing number of long noncoding (lnc)RNAs has been identified in breast cancer. The present study aimed to establish an lncRNA signature for predicting survival in breast cancer. RNA expression profiling was performed using microarray gene expression data from the National Center for Biotechnology Information Gene Expression Omnibus, followed by the identification of breast cancer‑related preserved modules using weighted gene co‑expression network (WGCNA) network analysis. From the lncRNAs identified in these preserved modules, prognostic lncRNAs were selected using univariate Cox regression analysis in combination with the L1‑penalized (LASSO) Cox‑proportional Hazards (Cox‑PH) model. A risk score based on these prognostic lncRNAs was calculated and used for risk stratification. Differentially expressed RNAs (DERs) in breast cancer were identified using MetaDE. Gene Set Enrichment Analysis pathway enrichment analysis was conducted for these prognostic lncRNAs and the DERs related to the lncRNAs in the preserved modules. A total of five preserved modules comprising 73 lncRNAs were mined. An eight‑lncRNA signature (IGHA1, IGHGP, IGKV2‑28, IGLL3P, IGLV3‑10, AZGP1P1, LINC00472 and SLC16A6P1) was identified using the LASSO Cox‑PH model. Risk score based on these eight lncRNAs could classify breast cancer patients into two groups with significantly different survival times. The eight‑lncRNA signature was validated using three independent cohorts. These prognostic lncRNAs were significantly associated with the cell adhesion molecules pathway, JAK‑signal transducer and activator of transcription 5A pathway, and erbb pathway and are potentially involved in regulating angiotensin II receptor type 1, neuropeptide Y receptor Y1, KISS1 receptor, and C‑C motif chemokine ligand 5. The developed eight‑lncRNA signature may have clinical implications for predicting prognosis in breast cancer. Overall, this study provided possible molecular targets for the development of novel therapies against breast cancer.
Project description:Clustered current status data are frequently encountered in biomedical research and other areas that require survival analysis. This paper proposes graphical and formal model assessment procedures to evaluate the goodness of fit of the additive hazards model to clustered current status data. The test statistics proposed are based on sums of martingale-based residuals. Relevant asymptotic properties are established, and empirical distributions of the test statistics can be simulated utilizing Gaussian multipliers. Extensive simulation studies confirmed that the proposed test procedures work well for practical scenarios. This proposed method applies when failure times within the same cluster are correlated, and in particular, when cluster sizes can be informative about intra-cluster correlations. The method is applied to analyze clustered current status data from a lung tumorigenicity study.
Project description:BackgroundThe additive hazards model can be easier to interpret and in some cases fits better than the proportional hazards model. However, sample size formulas for clinical trials with time to event outcomes are currently based on either the proportional hazards assumption or an assumption of constant hazards.AimsThe goal is to provide sample size formulas for superiority and non-inferiority trials assuming an additive hazards model but no specific distribution, along with evaluations of the performance of the formulas.MethodsFormulas are presented that determine the required sample size for a given scenario under the additive hazards model. Simulations are conducted to ensure that the formulas attain the desired power. For illustration, the non-inferiority sample size formula is applied to the calculations in the SPORTIF III trial of stroke prevention in atrial fibrillation.ConclusionSimulation results show that the sample size calculations lead to the correct power. Sample size is easily calculated using a tool that is available on the web at http://leemcdaniel.github.io/samplesize.html.
Project description:BackgroundThe successful identification of breast cancer (BRCA) prognostic biomarkers is essential for the strategic interference of BRCA patients. Recently, various methods have been proposed for exploring a small prognostic gene set that can distinguish the high-risk group from the low-risk group.MethodsRegularized Cox proportional hazards (RCPH) models were proposed to discover prognostic biomarkers of BRCA from gene expression data. Firstly, the maximum connected network with 1142 genes by mapping 956 differentially expressed genes (DEGs) and 677 previously BRCA-related genes into the gene regulatory network (GRN) was constructed. Then, the 72 union genes of the four feature gene sets identified by Lasso-RCPH, Enet-RCPH, [Formula: see text]-RCPH and SCAD-RCPH models were recognized as the robust prognostic biomarkers. These biomarkers were validated by literature checks, BRCA-specific GRN and functional enrichment analysis. Finally, an index of prognostic risk score (PRS) for BRCA was established based on univariate and multivariate Cox regression analysis. Survival analysis was performed to investigate the PRS on 1080 BRCA patients from the internal validation. Particularly, the nomogram was constructed to express the relationship between PRS and other clinical information on the discovery dataset. The PRS was also verified on 1848 BRCA patients of ten external validation datasets or collected cohorts.ResultsThe nomogram highlighted that the importance of PRS in guiding significance for the prognosis of BRCA patients. In addition, the PRS of 301 normal samples and 306 tumor samples from five independent datasets showed that it is significantly higher in tumors than in normal tissues ([Formula: see text]). The protein expression profiles of the three genes, i.e., ADRB1, SAV1 and TSPAN14, involved in the PRS model demonstrated that the latter two genes are more strongly stained in tumor specimens. More importantly, external validation illustrated that the high-risk group has worse survival than the low-risk group ([Formula: see text]) in both internal and external validations.ConclusionsThe proposed pipelines of detecting and validating prognostic biomarker genes for BRCA are effective and efficient. Moreover, the proposed PRS is very promising as an important indicator for judging the prognosis of BRCA patients.
Project description:IntroductionMachine learning algorithms such as elastic net regression and backward selection provide a unique and powerful approach to model building given a set of psychosocial predictors of smoking lapse measured repeatedly via ecological momentary assessment (EMA). Understanding these predictors may aid in developing interventions for smoking lapse prevention.MethodsIn a randomized-controlled smoking cessation trial, smartphone-based EMAs were collected from 92 participants following a scheduled quit date. This secondary analysis utilized elastic net-penalized cox proportional hazards regression and model approximation via backward elimination to (1) optimize a predictive model of time to first lapse and (2) simplify that model to its core constituent predictors to maximize parsimony and generalizability.ResultsElastic net proportional hazards regression selected 17 of 26 possible predictors from 2065 EMAs to model time to first lapse. The predictors with the highest magnitude regression coefficients were having consumed alcohol in the past hour, being around and interacting with a smoker, and having cigarettes easily available. This model was reduced using backward elimination, retaining five predictors and approximating to 93.9% of model fit. The retained predictors included those mentioned above as well as feeling irritable and being in areas where smoking is either discouraged or allowed (as opposed to not permitted).ConclusionsThe strongest predictors of smoking lapse were environmental in nature (e.g., being in smoking-permitted areas) as opposed to internal factors such as psychological affect. Interventions may be improved by a renewed focus of interventions on these predictors.ImplicationsThe present study demonstrated the utility of machine learning algorithms to optimize the prediction of time to smoking lapse using EMA data. The two models generated by the present analysis found that environmental factors were most strongly related to smoking lapse. The results support the use of machine learning algorithms to investigate intensive longitudinal data, and provide a foundation for the development of highly tailored, just-in-time interventions that can target on multiple antecedents of smoking lapse.
Project description:BackgroundIn survival analysis, data can be modeled using either a multiplicative hazards regression model (such as the Cox model) or an additive hazards regression model (such as Lin's or Aalen's model). While several diagnostic tools are available to check the assumptions underpinning each type of model, there is no defined procedure to fit these models optimally. Moreover, the two types of models are rarely combined in survival analysis. Here, we propose a strategy for optimal fitting of multiplicative and additive hazards regression models in survival analysis.MethodsThis section details our proposed strategy for optimal fitting of multiplicative and additive hazards regression models, with a focus on the assumptions underpinning each type of model, the diagnostic tools used to check these assumptions, and the steps followed to fit the data. The proposed strategy draws on classical diagnostic tools (Schoenfeld and martingale residuals) and less common tools (pseudo-observations, martingale residual processes, and Arjas plots).ResultsThe proposed strategy is applied to a dataset of patients with myocardial infarction (TRACE data frame). The effects of 5 covariates (age, sex, diabetes, ventricular fibrillation, and clinical heart failure) on the hazard of death are analyzed using multiplicative and additive hazards regression models. The proposed strategy is shown to fit the data optimally.ConclusionsSurvival analysis is improved by using multiplicative and additive hazards regression models together, but specific steps must be followed to fit the data optimally. By providing different measures of the same effect, our proposed strategy allows for better interpretation of the data.
Project description:In survival data analysis, a competing risk is an event whose occurrence precludes or alters the chance of the occurrence of the primary event of interest. In large cohort studies with long-term follow-up, there are often competing risks. Further, if the event of interest is rare in such large studies, the case-cohort study design is widely used to reduce the cost and achieve the same efficiency as a cohort study. The conventional additive hazards modeling for competing risks data in case-cohort studies involves the cause-specific hazard function, under which direct assessment of covariate effects on the cumulative incidence function, or the subdistribution, is not possible. In this paper, we consider an additive hazard model for the subdistribution of a competing risk in case-cohort studies. We propose estimating equations based on inverse probability weighting methods for the estimation of the model parameters. Consistency and asymptotic normality of the proposed estimators are established. The performance of the proposed methods in finite samples is examined through simulation studies and the proposed approach is applied to a case-cohort dataset from the Sister Study.
Project description:Identifying exceptional responders or nonresponders is an area of increased research interest in precision medicine as these patients may have different biological or molecular features and therefore may respond differently to therapies. Our motivation stems from a real example from a clinical trial where we are interested in characterizing exceptional prostate cancer responders. We investigate the outlier detection and robust regression problem in the sparse proportional hazards model for censored survival outcomes. The main idea is to model the irregularity of each observation by assigning an individual weight to the hazard function. By applying a LASSO-type penalty on both the model parameters and the log transformation of the weight vector, our proposed method is able to perform variable selection and outlier detection simultaneously. The optimization problem can be transformed to a typical penalized maximum partial likelihood problem and thus it is easy to implement. We further extend the proposed method to deal with the potential outlier masking problem caused by censored outcomes. The performance of the proposed estimator is demonstrated with extensive simulation studies and real data analyses in low-dimensional and high-dimensional settings.
Project description:We develop fast fitting methods for generalized functional linear models. The functional predictor is projected onto a large number of smooth eigenvectors and the coefficient function is estimated using penalized spline regression; confidence intervals based on the mixed model framework are obtained. Our method can be applied to many functional data designs including functions measured with and without error, sparsely or densely sampled. The methods also extend to the case of multiple functional predictors or functional predictors with a natural multilevel structure. The approach can be implemented using standard mixed effects software and is computationally fast. The methodology is motivated by a study of white-matter demyelination via diffusion tensor imaging (DTI). The aim of this study is to analyze differences between various cerebral white-matter tract property measurements of multiple sclerosis (MS) patients and controls. While the statistical developments proposed here were motivated by the DTI study, the methodology is designed and presented in generality and is applicable to many other areas of scientific research. An online appendix provides R implementations of all simulations.
Project description:BackgroundIdentifying genes and pathways associated with diseases such as cancer has been a subject of considerable research in recent years in the area of bioinformatics and computational biology. It has been demonstrated that the magnitude of differential expression does not necessarily indicate biological significance. Even a very small change in the expression of particular gene may have dramatic physiological consequences if the protein encoded by this gene plays a catalytic role in a specific cell function. Moreover, highly correlated genes may function together on the same pathway biologically. Finally, in sparse logistic regression with Lp (p < 1) penalty, the degree of the sparsity obtained is determined by the value of the regularization parameter. Usually this parameter must be carefully tuned through cross-validation, which is time consuming.ResultsIn this paper, we proposed a simple Bayesian approach to integrate the regularization parameter out analytically using a new prior. Therefore, there is no longer a need for parameter selection, as it is eliminated entirely from the model. The proposed algorithm (BLpLog) is typically two or three orders of magnitude faster than the original algorithm and free from bias in performance estimation. We also define a novel similarity measure and develop an integrated algorithm to hunt the regulatory genes with low expression changes but having high correlation with the selected genes. Pathways of those correlated genes were identified with DAVID http://david.abcc.ncifcrf.gov/.ConclusionExperimental results with gene expression data demonstrate that the proposed methods can be utilized to identify important genes and pathways that are related to cancer and build a parsimonious model for future patient predictions.