Dataset Information

Investigating alignment-free machine learning methods for HIV-1 subtype classification.

ABSTRACT:

Motivation

Many viruses are organized into taxonomies of subtypes based on their genetic similarities. For human immunodeficiency virus 1 (HIV-1), subtype classification plays a crucial role in infection management. Sequence alignment-based methods for subtype classification are impractical for large datasets because they are costly and time-consuming. Alignment-free methods involve creating numerical representations for genetic sequences and applying statistical or machine learning methods. Despite their high overall accuracy, existing models perform poorly on less common subtypes. Furthermore, there is limited work investigating the impact of sequence vectorization methods, in particular natural language-inspired embedding methods, on HIV-1 subtype classification.

Results

We present a comprehensive analysis of sequence vectorization methods across machine learning methods. We report a k-mer-based XGBoost model with a balanced accuracy of 0.84, indicating that it has good overall performance for both common and uncommon HIV-1 subtypes. We also report a Word2Vec-based support vector machine that achieves promising results on precision and balanced accuracy. Our study sheds light on the effect of sequence vectorization methods on HIV-1 subtype classification and suggests that natural language-inspired encoding methods show promise. Our results could help to develop improved HIV-1 subtype classification methods, leading to improved individual patient outcomes, and the development of subtype-specific treatments.

Availability and implementation

Source code is available at https://www.github.com/kwade4/HIV_Subtypes.

SUBMITTER: Wade KE

PROVIDER: S-EPMC11371153 | biostudies-literature | 2024

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Investigating alignment-free machine learning methods for HIV-1 subtype classification.

Wade Kaitlyn E KE Chen Lianghong L Deng Chutong C Zhou Gen G Hu Pingzhao P

Bioinformatics advances 20240729 1

<h4>Motivation</h4>Many viruses are organized into taxonomies of subtypes based on their genetic similarities. For human immunodeficiency virus 1 (HIV-1), subtype classification plays a crucial role in infection management. Sequence alignment-based methods for subtype classification are impractical for large datasets because they are costly and time-consuming. Alignment-free methods involve creating numerical representations for genetic sequences and applying statistical or machine learning meth ...[more]

PMID: 39228995

Similar Datasets

Project description:BackgroundNear-infrared indocyanine green angiography allows experienced surgeons to reliably evaluate parathyroid gland vitality during thyroid and parathyroid operations in order to predict postoperative function. To facilitate equal performance between surgeons, we developed an automatic computational quantification method using computer vision that portrays expert interpretation of visualized parathyroid gland near-infrared indocyanine green angiographic fluorescence signals.MethodsNear-infrared indocyanine green-parathyroid gland angiography video recordings (Fluobeam® LX, Fluoptics, Grenoble-part of Getinge-Göteborg) from patients undergoing endocrine cervical surgery in a high-volume unit were used for model development. Computation (MATLAB, Mathworks, Ireland) included segmentation-identification of the parathyroid gland (by autofluorescence), image stabilization (by linear translation) and adjusted time-fluorescence intensity profile generation. Relative upslope and maximum intensity ratios then trained a simple logistic regression model based on expert interpretation and outcome (including hypoparathyroidism), with subsequent unseen testing for validation.ResultsThe model was trained on 37 patient videos (45 glands, 29 judged well perfused by parathyroid gland angiography experts), achieving feature data separation with 100% accuracy, and tested on 22 unseen videos (27 glands, 15 judged well perfused), including four in real time. Segmentation-guided parathyroid gland detection correctly identified all parathyroid glands during unseen testing along with three additional non-parathyroid gland regions (90% positive predictive value). Subsequent time-fluorescence intensity profile extraction with vitality prediction was shown feasible in all cases within 5 min, with a 96.3% model accuracy (sensitivity and specificity were 93.3 and 100% respectively) when compared with expert judgement.ConclusionAutomatic parathyroid gland perfusion quantification using simple machine learning computational methods discriminates parathyroid gland perfusion in concordance with expert surgeon interpretation, providing a means for near-infrared indocyanine green-parathyroid gland signal evaluation.

Dataset Information

Investigating alignment-free machine learning methods for HIV-1 subtype classification.

Motivation

Results

Availability and implementation

Publications

Investigating alignment-free machine learning methods for HIV-1 subtype classification.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets