Ontology highlight
ABSTRACT:
SUBMITTER: Zhao H
PROVIDER: S-EPMC9044334 | biostudies-literature | 2022
REPOSITORIES: biostudies-literature
Zhao Hong H Chen Zhiwen Z Guo Lan L Han Zeyu Z
PeerJ. Computer science 20220316
Global encoding of visual features in video captioning is important for improving the description accuracy. In this paper, we propose a video captioning method that combines Vision Transformer (ViT) and reinforcement learning. Firstly, Resnet-152 and ResNeXt-101 are used to extract features from videos. Secondly, the encoding block of the ViT network is applied to encode video features. Thirdly, the encoded features are fed into a Long Short-Term Memory (LSTM) network to generate a video content ...[more]