Dataset Information

Contextualizing Genes by Using Text-Mined Co-Occurrence Features for Cancer Gene Panel Discovery.

ABSTRACT: Developing a biomedical-explainable and validatable text mining pipeline can help in cancer gene panel discovery. We create a pipeline that can contextualize genes by using text-mined co-occurrence features. We apply Biomedical Natural Language Processing (BioNLP) techniques for literature mining in the cancer gene panel. A literature-derived 4,679 × 4,630 gene term-feature matrix was built. The EGFR L858R and T790M, and BRAF V600E genetic variants are important mutation term features in text mining and are frequently mutated in cancer. We validate the cancer gene panel by the mutational landscape of different cancer types. The cosine similarity of gene frequency between text mining and a statistical result from clinical sequencing data is 80.8%. In different machine learning models, the best accuracy for the prediction of two different gene panels, including MSK-IMPACT (Memorial Sloan Kettering-Integrated Mutation Profiling of Actionable Cancer Targets), and Oncomine cancer gene panel, is 0.959, and 0.989, respectively. The receiver operating characteristic (ROC) curve analysis confirmed that the neural net model has a better prediction performance (Area under the ROC curve (AUC) = 0.992). The use of text-mined co-occurrence features can contextualize each gene. We believe the approach is to evaluate several existing gene panels, and show that we can use part of the gene panel set to predict the remaining genes for cancer discovery.

SUBMITTER: Chen HO

PROVIDER: S-EPMC8573063 | biostudies-literature | 2021

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Contextualizing Genes by Using Text-Mined Co-Occurrence Features for Cancer Gene Panel Discovery.

Chen Hui-O HO Lin Peng-Chan PC Liu Chen-Ruei CR Wang Chi-Shiang CS Chiang Jung-Hsien JH

Frontiers in genetics 20211025

Developing a biomedical-explainable and validatable text mining pipeline can help in cancer gene panel discovery. We create a pipeline that can contextualize genes by using text-mined co-occurrence features. We apply Biomedical Natural Language Processing (BioNLP) techniques for literature mining in the cancer gene panel. A literature-derived 4,679 × 4,630 gene term-feature matrix was built. The <i>EGFR</i> L858R and T790M, and <i>BRAF</i> V600E genetic variants are important mutation term featu ...[more]

PMID: 34759963

Dataset Information

Contextualizing Genes by Using Text-Mined Co-Occurrence Features for Cancer Gene Panel Discovery.

Publications

Contextualizing Genes by Using Text-Mined Co-Occurrence Features for Cancer Gene Panel Discovery.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Evaluating a variety of text-mined features for automatic protein function prediction with GOstruct.
| S-EPMC4441003 | biostudies-literature

Text-mined fossil biodiversity dynamics using machine learning.
| S-EPMC6501925 | biostudies-literature

Text-mined dataset of inorganic materials synthesis recipes.
| S-EPMC6794279 | biostudies-literature

Using text-mined trait data to test for cooperate-and-radiate co-evolution between ants and plants.
| S-EPMC6776258 | biostudies-literature

Imitating manual curation of text-mined facts in biomedicine.
| S-EPMC1560402 | biostudies-literature

Genome-wide identification of major genes and genomic prediction using high-density and text-mined gene-based SNP panels in Hanwoo (Korean cattle).
| S-EPMC7710051 | biostudies-literature

Text Authorship Identified Using the Dynamics of Word Co-Occurrence Networks.
| S-EPMC5268788 | biostudies-literature

Co-occurrence of transcription and translation gene regulatory features underlies coordinated mRNA and protein synthesis.
| S-EPMC4158080 | biostudies-literature

Integration and publication of heterogeneous text-mined relationships on the Semantic Web.
| S-EPMC3102890 | biostudies-other

CoCoScore: context-aware co-occurrence scoring for text mining applications using distant supervision.
| S-EPMC6956794 | biostudies-literature