Spatial-temporal interaction learning based two-stream network for action recognition

Tianyu Liu, Yujun Ma, Wenhan Yang, Wanting Ji, Ruili Wang, Ping Jiang

Research output: Journal PublicationArticlepeer-review

44 Citations (Scopus)

Abstract

Two-stream convolutional neural networks have been widely applied to action recognition. However, two-stream networks are usually adopted to capture spatial information and temporal information separately, which normally ignore the strong complementarity and correlation between spatial and temporal information in videos. To solve this problem, we propose a Spatial-Temporal Interaction Learning Two-stream network (STILT) for action recognition. Our proposed two-stream (i.e., a spatial stream and a temporal stream) network has a spatial–temporal interaction learning module, which uses an alternating co-attention mechanism between two streams to learn the correlation between spatial features and temporal features. The spatial–temporal interaction learning module allows the two streams to guide each other and then generates optimized spatial attention features and temporal attention features. Thus, the proposed network can establish the interactive connection between two streams, which efficiently exploits the attended spatial and temporal features to improve recognition accuracy. Experiments on three widely used datasets (i.e., UCF101, HMDB51 and Kinetics) show that the proposed network outperforms the state-of-the-art models in action recognition.

Original languageEnglish
Pages (from-to)864-876
Number of pages13
JournalInformation Sciences
Volume606
DOIs
Publication statusPublished - Aug 2022
Externally publishedYes

Keywords

  • Action recognition
  • Spatial-temporal
  • Two-stream CNNs

ASJC Scopus subject areas

  • Software
  • Control and Systems Engineering
  • Theoretical Computer Science
  • Computer Science Applications
  • Information Systems and Management
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Spatial-temporal interaction learning based two-stream network for action recognition'. Together they form a unique fingerprint.

Cite this