Relative-position embedding based spatially and temporally decoupled Transformer for action recognition

Yujun Ma; Ruili Wang

doi:10.1016/j.patcog.2023.109905

Relative-position embedding based spatially and temporally decoupled Transformer for action recognition

Yujun Ma, Ruili Wang

Research output: Journal Publication › Article › peer-review

16 Citations (Scopus)

Abstract

Recognition of human actions is to classify actions in a video. Recently, Vision Transformer (ViT) has been applied to action recognition. However, the Vision Transformer is unsuitable for high-resolution input videos due to the constraint of computing power since ViT splits frames into fixed-size patches embedded (i.e., tokens) with absolute-position information and adopts a pure Transformer encoder to model the relationships among these tokens. To address this issue, we propose a relative-position embedding based spatially and temporally decoupled Transformer (RPE-STDT) for action recognition, which can capture spatial–temporal information by stacked self-attention layers. The proposed RPE-STDT model consists of two separate series of Transformer encoders. The first series of encoders is the spatial Transformer encoders, which model interactions between tokens extracted from the same temporal index. The second series of encoders is the temporal Transformer encoders, which model interactions across time dimensions with a subsampling strategy. Furthermore, we replace the absolute-position embeddings in the Vision Transformer encoders with the proposed relative-position embeddings to capture the order of the embedded tokens to reduce computational costs. Finally, we conduct thorough ablation studies. Our RPE-STDT achieves state-of-the-art results on multiple action recognition datasets, exceeding prior convolution and Transformer-based networks.

Original language	English
Article number	109905
Journal	Pattern Recognition
Volume	145
DOIs	https://doi.org/10.1016/j.patcog.2023.109905
Publication status	Published - Jan 2024
Externally published	Yes

Keywords

Relative-position embedding
Spatial–temporal features
Subsampling
Transformer

ASJC Scopus subject areas

Software
Signal Processing
Computer Vision and Pattern Recognition
Artificial Intelligence

Access to Document

10.1016/j.patcog.2023.109905

Cite this

@article{631cfe905dbf46c3bbee63a5b752b4cd,

title = "Relative-position embedding based spatially and temporally decoupled Transformer for action recognition",

abstract = "Recognition of human actions is to classify actions in a video. Recently, Vision Transformer (ViT) has been applied to action recognition. However, the Vision Transformer is unsuitable for high-resolution input videos due to the constraint of computing power since ViT splits frames into fixed-size patches embedded (i.e., tokens) with absolute-position information and adopts a pure Transformer encoder to model the relationships among these tokens. To address this issue, we propose a relative-position embedding based spatially and temporally decoupled Transformer (RPE-STDT) for action recognition, which can capture spatial–temporal information by stacked self-attention layers. The proposed RPE-STDT model consists of two separate series of Transformer encoders. The first series of encoders is the spatial Transformer encoders, which model interactions between tokens extracted from the same temporal index. The second series of encoders is the temporal Transformer encoders, which model interactions across time dimensions with a subsampling strategy. Furthermore, we replace the absolute-position embeddings in the Vision Transformer encoders with the proposed relative-position embeddings to capture the order of the embedded tokens to reduce computational costs. Finally, we conduct thorough ablation studies. Our RPE-STDT achieves state-of-the-art results on multiple action recognition datasets, exceeding prior convolution and Transformer-based networks.",

keywords = "Relative-position embedding, Spatial–temporal features, Subsampling, Transformer",

author = "Yujun Ma and Ruili Wang",

note = "Publisher Copyright: {\textcopyright} 2023 Elsevier Ltd",

year = "2024",

month = jan,

doi = "10.1016/j.patcog.2023.109905",

language = "English",

volume = "145",

journal = "Pattern Recognition",

issn = "0031-3203",

publisher = "Elsevier Ltd.",

}

TY - JOUR

T1 - Relative-position embedding based spatially and temporally decoupled Transformer for action recognition

AU - Ma, Yujun

AU - Wang, Ruili

PY - 2024/1

Y1 - 2024/1

N2 - Recognition of human actions is to classify actions in a video. Recently, Vision Transformer (ViT) has been applied to action recognition. However, the Vision Transformer is unsuitable for high-resolution input videos due to the constraint of computing power since ViT splits frames into fixed-size patches embedded (i.e., tokens) with absolute-position information and adopts a pure Transformer encoder to model the relationships among these tokens. To address this issue, we propose a relative-position embedding based spatially and temporally decoupled Transformer (RPE-STDT) for action recognition, which can capture spatial–temporal information by stacked self-attention layers. The proposed RPE-STDT model consists of two separate series of Transformer encoders. The first series of encoders is the spatial Transformer encoders, which model interactions between tokens extracted from the same temporal index. The second series of encoders is the temporal Transformer encoders, which model interactions across time dimensions with a subsampling strategy. Furthermore, we replace the absolute-position embeddings in the Vision Transformer encoders with the proposed relative-position embeddings to capture the order of the embedded tokens to reduce computational costs. Finally, we conduct thorough ablation studies. Our RPE-STDT achieves state-of-the-art results on multiple action recognition datasets, exceeding prior convolution and Transformer-based networks.

AB - Recognition of human actions is to classify actions in a video. Recently, Vision Transformer (ViT) has been applied to action recognition. However, the Vision Transformer is unsuitable for high-resolution input videos due to the constraint of computing power since ViT splits frames into fixed-size patches embedded (i.e., tokens) with absolute-position information and adopts a pure Transformer encoder to model the relationships among these tokens. To address this issue, we propose a relative-position embedding based spatially and temporally decoupled Transformer (RPE-STDT) for action recognition, which can capture spatial–temporal information by stacked self-attention layers. The proposed RPE-STDT model consists of two separate series of Transformer encoders. The first series of encoders is the spatial Transformer encoders, which model interactions between tokens extracted from the same temporal index. The second series of encoders is the temporal Transformer encoders, which model interactions across time dimensions with a subsampling strategy. Furthermore, we replace the absolute-position embeddings in the Vision Transformer encoders with the proposed relative-position embeddings to capture the order of the embedded tokens to reduce computational costs. Finally, we conduct thorough ablation studies. Our RPE-STDT achieves state-of-the-art results on multiple action recognition datasets, exceeding prior convolution and Transformer-based networks.

KW - Relative-position embedding

KW - Spatial–temporal features

KW - Subsampling

KW - Transformer

UR - http://www.scopus.com/inward/record.url?scp=85170101301&partnerID=8YFLogxK

U2 - 10.1016/j.patcog.2023.109905

DO - 10.1016/j.patcog.2023.109905

M3 - Article

AN - SCOPUS:85170101301

SN - 0031-3203

VL - 145

JO - Pattern Recognition

JF - Pattern Recognition

M1 - 109905

ER -

Relative-position embedding based spatially and temporally decoupled Transformer for action recognition

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this