k-NN attention-based video vision transformer for action recognition

Weirong Sun, Yujun Ma, Ruili Wang

Research output: Journal PublicationArticlepeer-review

9 Citations (Scopus)

Abstract

Action Recognition aims to understand human behavior and predict a label for each action. Recently, Vision Transformer (ViT) has achieved remarkable performance on action recognition, which models the long sequences token over spatial and temporal index in a video. The fully-connected self-attention layer is the fundamental key in the vanilla Transformer. However, the redundant architecture of the vision Transformer model ignores the locality of video frame patches, which involves non-informative tokens and potentially leads to increased computational complexity. To solve this problem, we propose a k-NN attention-based Video Vision Transformer (k-ViViT) network for action recognition. We adopt k-NN attention to Video Vision Transformer (ViViT) instead of original self-attention, which can optimize the training process and neglect the irrelevant or noisy tokens in the input sequence. We conduct experiments on the UCF101 and HMDB51 datasets to verify the effectiveness of our model. The experimental results illustrate that the proposed k-ViViT achieves superior accuracy compared to several state-of-the-art models on these action recognition datasets.

Original languageEnglish
Article number127256
JournalNeurocomputing
Volume574
DOIs
Publication statusPublished - 14 Mar 2024
Externally publishedYes

Keywords

  • Action recognition
  • Attention mechanism
  • Transformer
  • Vision transformer

ASJC Scopus subject areas

  • Computer Science Applications
  • Cognitive Neuroscience
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'k-NN attention-based video vision transformer for action recognition'. Together they form a unique fingerprint.

Cite this