Spatiotemporal saliency-based multi-stream networks with attention-aware LSTM for action recognition

Zhenbing Liu; Zeya Li; Ruili Wang; Ming Zong; Wanting Ji

doi:10.1007/s00521-020-05144-7

Spatiotemporal saliency-based multi-stream networks with attention-aware LSTM for action recognition

Zhenbing Liu, Zeya Li, Ruili Wang, Ming Zong, Wanting Ji

Research output: Journal Publication › Article › peer-review

29 Citations (Scopus)

Abstract

Human action recognition is a process of labeling video frames with action labels. It is a challenging research topic since the background of videos is usually chaotic, which will reduce the performance of traditional human action recognition methods. In this paper, we propose a novel spatiotemporal saliency-based multi-stream ResNets (STS), which combines three streams (i.e., a spatial stream, a temporal stream and a spatiotemporal saliency stream) for human action recognition. Further, we propose a novel spatiotemporal saliency-based multi-stream ResNets with attention-aware long short-term memory (STS-ALSTM) network. The proposed STS-ALSTM model combines deep convolutional neural network (CNN) feature extractors with three attention-aware LSTMs to capture the temporal long-term dependency relationships between consecutive video frames, optical flow frames or spatiotemporal saliency frames. Experimental results on UCF-101 and HMDB-51 datasets demonstrate that our proposed STS method and STS-ALSTM model obtain competitive performance compared with the state-of-the-art methods.

Original language	English
Pages (from-to)	14593-14602
Number of pages	10
Journal	Neural Computing and Applications
Volume	32
Issue number	18
DOIs	https://doi.org/10.1007/s00521-020-05144-7
Publication status	Published - 1 Sept 2020
Externally published	Yes

Keywords

Action recognition
Attention-aware
LSTM
Multi-stream
Spatiotemporal saliency

ASJC Scopus subject areas

Software
Artificial Intelligence

Access to Document

10.1007/s00521-020-05144-7

Cite this

@article{a069ce3bb7144d6097696249082c18ed,

title = "Spatiotemporal saliency-based multi-stream networks with attention-aware LSTM for action recognition",

abstract = "Human action recognition is a process of labeling video frames with action labels. It is a challenging research topic since the background of videos is usually chaotic, which will reduce the performance of traditional human action recognition methods. In this paper, we propose a novel spatiotemporal saliency-based multi-stream ResNets (STS), which combines three streams (i.e., a spatial stream, a temporal stream and a spatiotemporal saliency stream) for human action recognition. Further, we propose a novel spatiotemporal saliency-based multi-stream ResNets with attention-aware long short-term memory (STS-ALSTM) network. The proposed STS-ALSTM model combines deep convolutional neural network (CNN) feature extractors with three attention-aware LSTMs to capture the temporal long-term dependency relationships between consecutive video frames, optical flow frames or spatiotemporal saliency frames. Experimental results on UCF-101 and HMDB-51 datasets demonstrate that our proposed STS method and STS-ALSTM model obtain competitive performance compared with the state-of-the-art methods.",

keywords = "Action recognition, Attention-aware, LSTM, Multi-stream, Spatiotemporal saliency",

author = "Zhenbing Liu and Zeya Li and Ruili Wang and Ming Zong and Wanting Ji",

note = "Publisher Copyright: {\textcopyright} 2020, Springer-Verlag London Ltd., part of Springer Nature.",

year = "2020",

month = sep,

day = "1",

doi = "10.1007/s00521-020-05144-7",

language = "English",

volume = "32",

pages = "14593--14602",

journal = "Neural Computing and Applications",

issn = "0941-0643",

publisher = "Springer London",

number = "18",

}

TY - JOUR

T1 - Spatiotemporal saliency-based multi-stream networks with attention-aware LSTM for action recognition

AU - Liu, Zhenbing

AU - Li, Zeya

AU - Wang, Ruili

AU - Zong, Ming

AU - Ji, Wanting

PY - 2020/9/1

Y1 - 2020/9/1

N2 - Human action recognition is a process of labeling video frames with action labels. It is a challenging research topic since the background of videos is usually chaotic, which will reduce the performance of traditional human action recognition methods. In this paper, we propose a novel spatiotemporal saliency-based multi-stream ResNets (STS), which combines three streams (i.e., a spatial stream, a temporal stream and a spatiotemporal saliency stream) for human action recognition. Further, we propose a novel spatiotemporal saliency-based multi-stream ResNets with attention-aware long short-term memory (STS-ALSTM) network. The proposed STS-ALSTM model combines deep convolutional neural network (CNN) feature extractors with three attention-aware LSTMs to capture the temporal long-term dependency relationships between consecutive video frames, optical flow frames or spatiotemporal saliency frames. Experimental results on UCF-101 and HMDB-51 datasets demonstrate that our proposed STS method and STS-ALSTM model obtain competitive performance compared with the state-of-the-art methods.

AB - Human action recognition is a process of labeling video frames with action labels. It is a challenging research topic since the background of videos is usually chaotic, which will reduce the performance of traditional human action recognition methods. In this paper, we propose a novel spatiotemporal saliency-based multi-stream ResNets (STS), which combines three streams (i.e., a spatial stream, a temporal stream and a spatiotemporal saliency stream) for human action recognition. Further, we propose a novel spatiotemporal saliency-based multi-stream ResNets with attention-aware long short-term memory (STS-ALSTM) network. The proposed STS-ALSTM model combines deep convolutional neural network (CNN) feature extractors with three attention-aware LSTMs to capture the temporal long-term dependency relationships between consecutive video frames, optical flow frames or spatiotemporal saliency frames. Experimental results on UCF-101 and HMDB-51 datasets demonstrate that our proposed STS method and STS-ALSTM model obtain competitive performance compared with the state-of-the-art methods.

KW - Action recognition

KW - Attention-aware

KW - LSTM

KW - Multi-stream

KW - Spatiotemporal saliency

UR - http://www.scopus.com/inward/record.url?scp=85087635067&partnerID=8YFLogxK

U2 - 10.1007/s00521-020-05144-7

DO - 10.1007/s00521-020-05144-7

M3 - Article

AN - SCOPUS:85087635067

SN - 0941-0643

VL - 32

SP - 14593

EP - 14602

JO - Neural Computing and Applications

JF - Neural Computing and Applications

IS - 18

ER -

Spatiotemporal saliency-based multi-stream networks with attention-aware LSTM for action recognition

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this