Abstract
Two-stream convolutional neural networks have been widely applied to action recognition. However, two-stream networks are usually adopted to capture spatial information and temporal information separately, which normally ignore the strong complementarity and correlation between spatial and temporal information in videos. To solve this problem, we propose a Spatial-Temporal Interaction Learning Two-stream network (STILT) for action recognition. Our proposed two-stream (i.e., a spatial stream and a temporal stream) network has a spatial–temporal interaction learning module, which uses an alternating co-attention mechanism between two streams to learn the correlation between spatial features and temporal features. The spatial–temporal interaction learning module allows the two streams to guide each other and then generates optimized spatial attention features and temporal attention features. Thus, the proposed network can establish the interactive connection between two streams, which efficiently exploits the attended spatial and temporal features to improve recognition accuracy. Experiments on three widely used datasets (i.e., UCF101, HMDB51 and Kinetics) show that the proposed network outperforms the state-of-the-art models in action recognition.
Original language | English |
---|---|
Pages (from-to) | 864-876 |
Number of pages | 13 |
Journal | Information Sciences |
Volume | 606 |
DOIs | |
Publication status | Published - Aug 2022 |
Externally published | Yes |
Keywords
- Action recognition
- Spatial-temporal
- Two-stream CNNs
ASJC Scopus subject areas
- Software
- Control and Systems Engineering
- Theoretical Computer Science
- Computer Science Applications
- Information Systems and Management
- Artificial Intelligence