Unknown

Dataset Information

0

Investigating alignment-free machine learning methods for HIV-1 subtype classification.


ABSTRACT:

Motivation

Many viruses are organized into taxonomies of subtypes based on their genetic similarities. For human immunodeficiency virus 1 (HIV-1), subtype classification plays a crucial role in infection management. Sequence alignment-based methods for subtype classification are impractical for large datasets because they are costly and time-consuming. Alignment-free methods involve creating numerical representations for genetic sequences and applying statistical or machine learning methods. Despite their high overall accuracy, existing models perform poorly on less common subtypes. Furthermore, there is limited work investigating the impact of sequence vectorization methods, in particular natural language-inspired embedding methods, on HIV-1 subtype classification.

Results

We present a comprehensive analysis of sequence vectorization methods across machine learning methods. We report a k-mer-based XGBoost model with a balanced accuracy of 0.84, indicating that it has good overall performance for both common and uncommon HIV-1 subtypes. We also report a Word2Vec-based support vector machine that achieves promising results on precision and balanced accuracy. Our study sheds light on the effect of sequence vectorization methods on HIV-1 subtype classification and suggests that natural language-inspired encoding methods show promise. Our results could help to develop improved HIV-1 subtype classification methods, leading to improved individual patient outcomes, and the development of subtype-specific treatments.

Availability and implementation

Source code is available at https://www.github.com/kwade4/HIV_Subtypes.

SUBMITTER: Wade KE 

PROVIDER: S-EPMC11371153 | biostudies-literature | 2024

REPOSITORIES: biostudies-literature

altmetric image

Publications

Investigating alignment-free machine learning methods for HIV-1 subtype classification.

Wade Kaitlyn E KE   Chen Lianghong L   Deng Chutong C   Zhou Gen G   Hu Pingzhao P  

Bioinformatics advances 20240729 1


<h4>Motivation</h4>Many viruses are organized into taxonomies of subtypes based on their genetic similarities. For human immunodeficiency virus 1 (HIV-1), subtype classification plays a crucial role in infection management. Sequence alignment-based methods for subtype classification are impractical for large datasets because they are costly and time-consuming. Alignment-free methods involve creating numerical representations for genetic sequences and applying statistical or machine learning meth  ...[more]

Similar Datasets

| S-EPMC6288788 | biostudies-literature
| S-EPMC7910972 | biostudies-literature
| S-EPMC10417520 | biostudies-literature
| S-EPMC8472680 | biostudies-literature
| S-EPMC6912926 | biostudies-literature
| S-EPMC4743480 | biostudies-literature
| S-EPMC11518927 | biostudies-literature
| S-EPMC6158771 | biostudies-other
| S-EPMC10929170 | biostudies-literature
| S-EPMC9941804 | biostudies-literature