Unknown

Dataset Information

0

DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data.


ABSTRACT: Bacteriophages (phages) are the most abundant and diverse biological entity on Earth. Due to the lack of universal gene markers and database representatives, there about 50-90% of genes of phages are unable to assign functions. This makes it a challenge to identify phage genomes and annotate functions of phage genes efficiently by homology search on a large scale, especially for newly phages. Portal (portal protein), TerL (large terminase subunit protein), and TerS (small terminase subunit protein) are three specific proteins of Caudovirales phage. Here, we developed a CNN (convolutional neural network)-based framework, DeephageTP, to identify the three specific proteins from metagenomic data. The framework takes one-hot encoding data of original protein sequences as the input and automatically extracts predictive features in the process of modeling. To overcome the false positive problem, a cutoff-loss-value strategy is introduced based on the distributions of the loss values of protein sequences within the same category. The proposed model with a set of cutoff-loss-values demonstrates high performance in terms of Precision in identifying TerL and Portal sequences (94% and 90%, respectively) from the mimic metagenomic dataset. Finally, we tested the efficacy of the framework using three real metagenomic datasets, and the results shown that compared to the conventional alignment-based methods, our proposed framework had a particular advantage in identifying the novel phage-specific protein sequences of portal and TerL with remote homology to their counterparts in the training datasets. In summary, our study for the first time develops a CNN-based framework for identifying the phage-specific protein sequences with high complexity and low conservation, and this framework will help us find novel phages in metagenomic sequencing data. The DeephageTP is available at https://github.com/chuym726/DeephageTP.

SUBMITTER: Chu Y 

PROVIDER: S-EPMC9188312 | biostudies-literature | 2022

REPOSITORIES: biostudies-literature

altmetric image

Publications

DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data.

Chu Yunmeng Y   Guo Shun S   Cui Dachao D   Fu Xiongfei X   Ma Yingfei Y  

PeerJ 20220608


Bacteriophages (phages) are the most abundant and diverse biological entity on Earth. Due to the lack of universal gene markers and database representatives, there about 50-90% of genes of phages are unable to assign functions. This makes it a challenge to identify phage genomes and annotate functions of phage genes efficiently by homology search on a large scale, especially for newly phages. Portal (portal protein), TerL (large terminase subunit protein), and TerS (small terminase subunit prote  ...[more]

Similar Datasets

| S-EPMC9669054 | biostudies-literature
| S-EPMC7937620 | biostudies-literature
| S-EPMC7432192 | biostudies-literature
| 2443187 | ecrin-mdr-crc
| S-EPMC10830981 | biostudies-literature
| S-EPMC8249591 | biostudies-literature
| S-EPMC9672459 | biostudies-literature
| S-EPMC10973583 | biostudies-literature
| S-EPMC11329666 | biostudies-literature
2021-01-11 | GSE147113 | GEO