Unknown

Dataset Information

0

TBGA: a large-scale Gene-Disease Association dataset for Biomedical Relation Extraction.


ABSTRACT:

Background

Databases are fundamental to advance biomedical science. However, most of them are populated and updated with a great deal of human effort. Biomedical Relation Extraction (BioRE) aims to shift this burden to machines. Among its different applications, the discovery of Gene-Disease Associations (GDAs) is one of BioRE most relevant tasks. Nevertheless, few resources have been developed to train models for GDA extraction. Besides, these resources are all limited in size-preventing models from scaling effectively to large amounts of data.

Results

To overcome this limitation, we have exploited the DisGeNET database to build a large-scale, semi-automatically annotated dataset for GDA extraction. DisGeNET stores one of the largest available collections of genes and variants involved in human diseases. Relying on DisGeNET, we developed TBGA: a GDA extraction dataset generated from more than 700K publications that consists of over 200K instances and 100K gene-disease pairs. Each instance consists of the sentence from which the GDA was extracted, the corresponding GDA, and the information about the gene-disease pair.

Conclusions

TBGA is amongst the largest datasets for GDA extraction. We have evaluated state-of-the-art models for GDA extraction on TBGA, showing that it is a challenging and well-suited dataset for the task. We made the dataset publicly available to foster the development of state-of-the-art BioRE models for GDA extraction.

SUBMITTER: Marchesin S 

PROVIDER: S-EPMC8973894 | biostudies-literature | 2022 Mar

REPOSITORIES: biostudies-literature

altmetric image

Publications

TBGA: a large-scale Gene-Disease Association dataset for Biomedical Relation Extraction.

Marchesin Stefano S   Silvello Gianmaria G  

BMC bioinformatics 20220331 1


<h4>Background</h4>Databases are fundamental to advance biomedical science. However, most of them are populated and updated with a great deal of human effort. Biomedical Relation Extraction (BioRE) aims to shift this burden to machines. Among its different applications, the discovery of Gene-Disease Associations (GDAs) is one of BioRE most relevant tasks. Nevertheless, few resources have been developed to train models for GDA extraction. Besides, these resources are all limited in size-preventin  ...[more]

Similar Datasets

| S-EPMC9487702 | biostudies-literature
| S-EPMC10551783 | biostudies-literature
| S-EPMC7222583 | biostudies-literature
| S-EPMC10112952 | biostudies-literature
| S-EPMC5086401 | biostudies-literature
| S-EPMC10407973 | biostudies-literature
| S-EPMC10370213 | biostudies-literature
| S-EPMC11629692 | biostudies-literature
| S-EPMC9578183 | biostudies-literature
| S-EPMC7485218 | biostudies-literature