Project description:Although normal tissue samples adjacent to tumors are sometimes collected from patients in cancer studies, they are often used as normal controls to identify genes differentially expressed between tumor and normal samples. However, it is in general more difficult to obtain and clearly define paired normal samples, and whether these samples should be treated as "normal" due to their close proximity to tumors. In this article, by analyzing the accrued data in The Cancer Genome Atlas (TCGA), we show the surprising results that the paired normal samples are in general more informative on patient survival than tumors. Different lines of evidence suggest that this is likely due to tumor micro-environment instead of tumor cell contamination or field cancerization effect. Pathway analyses suggest that tumor micro-environment may play an important role in cancer patient survival either by boosting the adjacent metabolism or the in situ immunization. Our results suggest the potential benefit of collecting and profiling matched normal tissues to gain more insights on disease etiology and patient progression.
Project description:Adverse pregnancy outcomes (APOs) are associated with an increased risk of chronic diseases, including cardiovascular disease (CVD) and metabolic syndrome (MS), in the future. We designed a large-scale cohort study to evaluate the influence of APOs (preeclampsia, gestational diabetes mellitus (GDM), stillbirth, macrosomia, and low birth weight) on the incidence of chronic diseases, body measurements, and serum biochemistry in the future and investigate whether combinations of APOs had additive effects on chronic diseases. We used health examinee data from the Korean Genome and Epidemiology Study (KoGES-HEXA) and extracted data of parous women (n = 30,174; mean age, 53.02 years) for the analysis. Women with APOs were more frequently diagnosed with chronic diseases and had a family history of chronic diseases compared with women without APOs. Composite APOs were associated with an increased risk of hypertension, diabetes mellitus, hyperlipidemia, angina pectoris, stroke, and MS (adjusted odds ratio: 1.093, 1.379, 1.269, 1.351, 1.414, and 1.104, respectively) after adjustment for family history and social behaviors. Preeclampsia and GDM were associated with an increased risk of some chronic diseases; however, the combination of preeclampsia and GDM did not have an additive effect on the risk. APOs moderately influenced the future development of maternal CVD and metabolic derangements, independent of family history and social behaviors.
Project description:Mucins are commonly associated with pancreatic ductal adenocarcinoma (PDAC) that is a deadly disease because of the lack of early diagnosis and efficient therapies. There are 22 mucin genes encoding large O-glycoproteins divided into two major subgroups: membrane-bound and secreted mucins. We investigated mucin expression and their impact on patient survival in the PDAC dataset from The Cancer Genome Atlas (PAAD-TCGA). We observed a statistically significant increased messenger RNA (mRNA) relative level of most of the membrane-bound mucins (MUC1/3A/4/12/13/16/17/20), secreted mucins (MUC5AC/5B), and atypical mucins (MUC14/18) compared to normal pancreas. We show that MUC1/4/5B/14/17/20/21 mRNA levels are associated with poorer survival in the high-expression group compared to the low-expression group. Using unsupervised clustering analysis of mucin gene expression patterns, we identified two major clusters of patients. Cluster #1 harbors a higher expression of MUC15 and atypical MUC14/MUC18, whereas cluster #2 is characterized by a global overexpression of membrane-bound mucins (MUC1/4/16/17/20/21). Cluster #2 is associated with shorter overall survival. The patient stratification appears to be independent of usual clinical features (tumor stage, differentiation grade, lymph node invasion) suggesting that the pattern of membrane-bound mucin expression could be a new prognostic marker for PDAC patients.
Project description:Both serum creatinine (sCr) and estimated glomerular filtration rate (eGFR) have been used to assess kidney function in public health check-ups. However, when the sCr is within the normal levels but the eGFR is <60 mL/min/1.73 m², a dilemma arises, as the patients might progress to chronic kidney disease (CKD) after several years. We aimed to evaluate the association between normal sCr and the risk of incident CKD in the general population. For this, 9445 subjects from the Korean Genome and Epidemiology Study, with normal sCr and eGFR of >60 mL/min/1.73 m² were analyzed. The subjects were classified into quartiles based on sCr levels. The primary outcome was the development of eGFR <60 mL/min/1.73 m² on two consecutive measures. During a mean follow-up of 8.4 ± 4.3 years, 779 (8.2%) subjects developed eGFR <60 mL/min/1.73 m². The incidence of the development of eGFR <60 mL/min/1.73 m² was higher in the higher quartiles than in the lowest quartile. In multivariable Cox analysis, the highest quartile was associated with an increased risk for the development of eGFR <60 mL/min/1.73 m² (hazard ratio (HR), 4.71; 95% confidence interval (CI), 3.29⁻6.74 in females; HR, 12.77; 95% CI, 7.69⁻21.23 in males). In the receiver operating characteristic curve analysis, adding sCr to the traditional risk factors for CKD improved the accuracy of predicting the development of eGFR <60 mL/min/1.73 m² (area under the curve, 0.83 vs. 0.80 in females and 0.85 vs. 0.78 in males), and the cutoff value of sCr was 0.75 mg/dL and 0.78 mg/dL in females and males. Cautious interpretation is necessary when sCr is within the normal range, considering that the upper normal range of sCr has a higher risk of CKD development.
Project description:Effective and powerful survival mediation models are currently lacking. To partly fill such knowledge gap, we particularly focus on the mediation analysis that includes multiple DNA methylations acting as exposures, one gene expression as the mediator and one survival time as the outcome. We proposed IUSMMT (intersection-union survival mixture-adjusted mediation test) to effectively examine the existence of mediation effect by fitting an empirical three-component mixture null distribution. With extensive simulation studies, we demonstrated the advantage of IUSMMT over existing methods. We applied IUSMMT to ten TCGA cancers and identified multiple genes that exhibited mediating effects. We further revealed that most of the identified regions, in which genes behaved as active mediators, were cancer type-specific and exhibited a full mediation from DNA methylation CpG sites to the survival risk of various types of cancers. Overall, IUSMMT represents an effective and powerful alternative for survival mediation analysis; our results also provide new insights into the functional role of DNA methylation and gene expression in cancer progression/prognosis and demonstrate potential therapeutic targets for future clinical practice.
Project description:Chronic diseases represent a serious threat to public health across the world. It is estimated at about 60% of all deaths worldwide and approximately 43% of the global burden of chronic diseases. Thus, the analysis of the healthcare data has helped health officials, patients, and healthcare communities to perform early detection for those diseases. Extracting the patterns from healthcare data has helped the healthcare communities to obtain complete medical data for the purpose of diagnosis. The objective of the present research work is presented to improve the surveillance detection system for chronic diseases, which is used for the protection of people's lives. For this purpose, the proposed system has been developed to enhance the detection of chronic disease by using machine learning algorithms. The standard data related to chronic diseases have been collected from various worldwide resources. In healthcare data, special chronic diseases include ambiguous objects of the class. Therefore, the presence of ambiguous objects shows the availability of traits involving two or more classes, which reduces the accuracy of the machine learning algorithms. The novelty of the current research work lies in the assumption that demonstrates the noncrisp Rough K-means (RKM) clustering for figuring out the ambiguity in chronic disease dataset to improve the performance of the system. The RKM algorithm has clustered data into two sets, namely, the upper approximation and lower approximation. The objects belonging to the upper approximation are favourable objects, whereas the ones belonging to the lower approximation are excluded and identified as ambiguous. These ambiguous objects have been excluded to improve the machine learning algorithms. The machine learning algorithms, namely, naïve Bayes (NB), support vector machine (SVM), K-nearest neighbors (KNN), and random forest tree, are presented and compared. The chronic disease data are obtained from the machine learning repository and Kaggle to test and evaluate the proposed model. The experimental results demonstrate that the proposed system is successfully employed for the diagnosis of chronic diseases. The proposed model achieved the best results with naive Bayes with RKM for the classification of diabetic disease (80.55%), whereas SVM with RKM for the classification of kidney disease achieved 100% and SVM with RKM for the classification of cancer disease achieved 97.53 with respect to accuracy metric. The performance measures, such as accuracy, sensitivity, specificity, precision, and F-score, are employed to evaluate the performance of the proposed system. Furthermore, evaluation and comparison of the proposed system with the existing machine learning algorithms are presented. Finally, the proposed system has enhanced the performance of machine learning algorithms.
Project description:Highlights • Disease risk prediction trained with one-class labeled data.• Input parameter gender-age-dependency removal removes training bias.• Two-stage clustering approach identifies high-risk individuals before disease onset.• Seniors in the high-risk group have at least twice the risk of developing disease. Early detection of chronic diseases such as cardiovascular disease (CVD) and diabetes can make the difference between life and death. Previous studies have demonstrated the feasibility of disease diagnosis and prediction using machine learning and disease-indicating biomarkers. The aim of this study is to develop a method to detect the risk of future disease even when disease-indicating biomarker readings are in the normal range. Data from the US Centers for Disease Control and Prevention (CDC) National Health and Nutrition Examination Surveys (NHANES) are used for this study. A two-stage semi-supervised K-Means (SSK-Means) clustering approach was developed to identify the underlying risk of each individual and categorize them into high or low-risk groups for CVD and diabetes. Our developed method of classification can identify groups as high risk or low risk, even if they would have been considered normal using traditional biomarker threshold criteria. For CVD, the SSK-Means clustering results showed that individuals over 30 years of age in the high-risk group were almost twice as likely to develop CVD as individuals in the low-risk group. For diabetes, the SSK-Means clustering results showed that individuals over 50 years in the high-risk group have at least two times the risk of developing diabetes compared with individuals in the low-risk group.
Project description:BackgroundSeveral genome-wide association studies (GWAS) have been performed to identify variants related to chronic diseases. Somatic variants in cancer tissues are associated with cancer development and prognosis. Expression quantitative trait loci (eQTL) and methylation QTL (mQTL) analyses were performed on chronic disease-related variants in TCGA dataset.MethodsMuTect2 calling variants for 33 cancers from TCGA and 296 GWAS variants provided by LocusZoom were used. At least one mutation was found in TCGA 22 cancers and LocusZoom 23 studies. Differentially expressed genes (DEGs) and differentially methylated regions (DMRs) from the three cancers (TCGA-COAD, TCGA-STAD, and TCGA-UCEC). Variants were mapped to the world map using population locations of the 1000 Genomes Project (1GP) populations. Decision tree analysis was performed on the discovered features and survival analysis was performed according to the cluster.ResultsBased on the DEGs and DMRs with clinical data, the decision tree model classified seven and three nodes in TCGA-COAD and TCGA-STAD, respectively. A total of 11 variants were commonly detected from TCGA and LocusZoom, and eight variants were selected from the 1GP variants, and the distribution patterns were visualized on the world map.ConclusionsVariants related to tumors and chronic diseases were selected, and their geological regional 1GP-based proportions are presented. The variant distribution patterns could provide clues for regional clinical trial designs and personalized medicine.
Project description:Survival analysis involves the modelling of the times to event. Proposed neural network approaches maximise the predictive performance of traditional survival models at the cost of their interpretability. This impairs their applicability in high stake domains such as medicine. Providing insights into the survival distributions would tackle this issue and advance the medical understanding of diseases. This paper approaches survival analysis as a mixture of neural baselines whereby different baseline cumulative hazard functions are modelled using positive and monotone neural networks. The efficiency of the solution is demonstrated on three datasets while enabling the discovery of new survival phenotypes.