Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action and Gesture Recognition

Yujun Ma; Benjia Zhou; Ruili Wang; Pichao Wang

doi:10.1145/3581783.3612301

Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action and Gesture Recognition

Yujun Ma, Benjia Zhou, Ruili Wang, Pichao Wang

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

3 Citations (Scopus)

Abstract

RGB-D action and gesture recognition remain an interesting topic in human-centered scene understanding, primarily due to the multiple granularities and large variation in human motion. Although many RGB-D based action and gesture recognition approaches have demonstrated remarkable results by utilizing highly integrated spatio-temporal representations across multiple modalities (i.e., RGB and depth data), they still encounter several challenges. Firstly, vanilla 3D convolution makes it hard to capture fine-grained motion differences between local clips under different modalities. Secondly, the intricate nature of highly integrated spatio-temporal modeling can lead to optimization difficulties. Thirdly, duplicate and unnecessary information can add complexity and complicate entangled spatio-temporal modeling. To address the above issues, we propose an innovative heuristic architecture called Multi-stage Factorized Spatio-Temporal (MFST) for RGB-D action and gesture recognition. The proposed MFST model comprises a 3D Central Difference Convolution Stem (CDC-Stem) module and multiple factorized spatio-temporal stages. The CDC-Stem enriches fine-grained temporal perception, and the multiple hierarchical spatio-temporal stages construct dimension-independent higher-order semantic primitives. Specifically, the CDC-Stem module captures bottom-level spatio-temporal features and passes them successively to the following spatio-temporal factored stages to capture the hierarchical spatial and temporal features through the Multi-Scale Convolution and Transformer (MSC-Trans) hybrid block and Weight-shared Multi-Scale Transformer (WMS-Trans) block. The seamless integration of these innovative designs results in a robust spatio-temporal representation that outperforms state-of-the-art approaches on RGB-D action and gesture recognition datasets.

Original language	English
Title of host publication	MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
Publisher	Association for Computing Machinery, Inc
Pages	3149-3160
Number of pages	12
ISBN (Electronic)	9798400701085
DOIs	https://doi.org/10.1145/3581783.3612301
Publication status	Published - 26 Oct 2023
Externally published	Yes
Event	31st ACM International Conference on Multimedia, MM 2023 - Ottawa, Canada Duration: 29 Oct 2023 → 3 Nov 2023

Publication series

Name	MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia

Conference

Conference	31st ACM International Conference on Multimedia, MM 2023
Country/Territory	Canada
City	Ottawa
Period	29/10/23 → 3/11/23

Keywords

multi-modal representation
rgb-d action and gesture recognition
spatio-temporal

ASJC Scopus subject areas

Artificial Intelligence
Computer Graphics and Computer-Aided Design
Human-Computer Interaction
Software

Access to Document

10.1145/3581783.3612301

Cite this

Ma, Y., Zhou, B., Wang, R., & Wang, P. (2023). Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action and Gesture Recognition. In MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia (pp. 3149-3160). (MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia). Association for Computing Machinery, Inc. https://doi.org/10.1145/3581783.3612301

@inproceedings{8b01e130faff40aaaad72aca01be3393,

title = "Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action and Gesture Recognition",

abstract = "RGB-D action and gesture recognition remain an interesting topic in human-centered scene understanding, primarily due to the multiple granularities and large variation in human motion. Although many RGB-D based action and gesture recognition approaches have demonstrated remarkable results by utilizing highly integrated spatio-temporal representations across multiple modalities (i.e., RGB and depth data), they still encounter several challenges. Firstly, vanilla 3D convolution makes it hard to capture fine-grained motion differences between local clips under different modalities. Secondly, the intricate nature of highly integrated spatio-temporal modeling can lead to optimization difficulties. Thirdly, duplicate and unnecessary information can add complexity and complicate entangled spatio-temporal modeling. To address the above issues, we propose an innovative heuristic architecture called Multi-stage Factorized Spatio-Temporal (MFST) for RGB-D action and gesture recognition. The proposed MFST model comprises a 3D Central Difference Convolution Stem (CDC-Stem) module and multiple factorized spatio-temporal stages. The CDC-Stem enriches fine-grained temporal perception, and the multiple hierarchical spatio-temporal stages construct dimension-independent higher-order semantic primitives. Specifically, the CDC-Stem module captures bottom-level spatio-temporal features and passes them successively to the following spatio-temporal factored stages to capture the hierarchical spatial and temporal features through the Multi-Scale Convolution and Transformer (MSC-Trans) hybrid block and Weight-shared Multi-Scale Transformer (WMS-Trans) block. The seamless integration of these innovative designs results in a robust spatio-temporal representation that outperforms state-of-the-art approaches on RGB-D action and gesture recognition datasets.",

keywords = "multi-modal representation, rgb-d action and gesture recognition, spatio-temporal",

author = "Yujun Ma and Benjia Zhou and Ruili Wang and Pichao Wang",

note = "Publisher Copyright: {\textcopyright} 2023 ACM.; 31st ACM International Conference on Multimedia, MM 2023 ; Conference date: 29-10-2023 Through 03-11-2023",

year = "2023",

month = oct,

day = "26",

doi = "10.1145/3581783.3612301",

language = "English",

series = "MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia",

publisher = "Association for Computing Machinery, Inc",

pages = "3149--3160",

booktitle = "MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia",

}

Ma, Y, Zhou, B, Wang, R & Wang, P 2023, Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action and Gesture Recognition. in MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia. MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia, Association for Computing Machinery, Inc, pp. 3149-3160, 31st ACM International Conference on Multimedia, MM 2023, Ottawa, Canada, 29/10/23. https://doi.org/10.1145/3581783.3612301

Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action and Gesture Recognition. / Ma, Yujun; Zhou, Benjia; Wang, Ruili et al.
MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia. Association for Computing Machinery, Inc, 2023. p. 3149-3160 (MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia).

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action and Gesture Recognition

AU - Ma, Yujun

AU - Zhou, Benjia

AU - Wang, Ruili

AU - Wang, Pichao

PY - 2023/10/26

Y1 - 2023/10/26

N2 - RGB-D action and gesture recognition remain an interesting topic in human-centered scene understanding, primarily due to the multiple granularities and large variation in human motion. Although many RGB-D based action and gesture recognition approaches have demonstrated remarkable results by utilizing highly integrated spatio-temporal representations across multiple modalities (i.e., RGB and depth data), they still encounter several challenges. Firstly, vanilla 3D convolution makes it hard to capture fine-grained motion differences between local clips under different modalities. Secondly, the intricate nature of highly integrated spatio-temporal modeling can lead to optimization difficulties. Thirdly, duplicate and unnecessary information can add complexity and complicate entangled spatio-temporal modeling. To address the above issues, we propose an innovative heuristic architecture called Multi-stage Factorized Spatio-Temporal (MFST) for RGB-D action and gesture recognition. The proposed MFST model comprises a 3D Central Difference Convolution Stem (CDC-Stem) module and multiple factorized spatio-temporal stages. The CDC-Stem enriches fine-grained temporal perception, and the multiple hierarchical spatio-temporal stages construct dimension-independent higher-order semantic primitives. Specifically, the CDC-Stem module captures bottom-level spatio-temporal features and passes them successively to the following spatio-temporal factored stages to capture the hierarchical spatial and temporal features through the Multi-Scale Convolution and Transformer (MSC-Trans) hybrid block and Weight-shared Multi-Scale Transformer (WMS-Trans) block. The seamless integration of these innovative designs results in a robust spatio-temporal representation that outperforms state-of-the-art approaches on RGB-D action and gesture recognition datasets.

AB - RGB-D action and gesture recognition remain an interesting topic in human-centered scene understanding, primarily due to the multiple granularities and large variation in human motion. Although many RGB-D based action and gesture recognition approaches have demonstrated remarkable results by utilizing highly integrated spatio-temporal representations across multiple modalities (i.e., RGB and depth data), they still encounter several challenges. Firstly, vanilla 3D convolution makes it hard to capture fine-grained motion differences between local clips under different modalities. Secondly, the intricate nature of highly integrated spatio-temporal modeling can lead to optimization difficulties. Thirdly, duplicate and unnecessary information can add complexity and complicate entangled spatio-temporal modeling. To address the above issues, we propose an innovative heuristic architecture called Multi-stage Factorized Spatio-Temporal (MFST) for RGB-D action and gesture recognition. The proposed MFST model comprises a 3D Central Difference Convolution Stem (CDC-Stem) module and multiple factorized spatio-temporal stages. The CDC-Stem enriches fine-grained temporal perception, and the multiple hierarchical spatio-temporal stages construct dimension-independent higher-order semantic primitives. Specifically, the CDC-Stem module captures bottom-level spatio-temporal features and passes them successively to the following spatio-temporal factored stages to capture the hierarchical spatial and temporal features through the Multi-Scale Convolution and Transformer (MSC-Trans) hybrid block and Weight-shared Multi-Scale Transformer (WMS-Trans) block. The seamless integration of these innovative designs results in a robust spatio-temporal representation that outperforms state-of-the-art approaches on RGB-D action and gesture recognition datasets.

KW - multi-modal representation

KW - rgb-d action and gesture recognition

KW - spatio-temporal

UR - http://www.scopus.com/inward/record.url?scp=85179546507&partnerID=8YFLogxK

U2 - 10.1145/3581783.3612301

DO - 10.1145/3581783.3612301

M3 - Conference contribution

AN - SCOPUS:85179546507

T3 - MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia

SP - 3149

EP - 3160

BT - MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia

PB - Association for Computing Machinery, Inc

T2 - 31st ACM International Conference on Multimedia, MM 2023

Y2 - 29 October 2023 through 3 November 2023

ER -

Ma Y, Zhou B, Wang R, Wang P. Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action and Gesture Recognition. In MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia. Association for Computing Machinery, Inc. 2023. p. 3149-3160. (MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia). doi: 10.1145/3581783.3612301

Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action and Gesture Recognition

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this