TY - GEN
T1 - Self-supervised learning of dynamic representations for static images
AU - Song, Siyang
AU - Sanchez, Enrique
AU - Shen, Linlin
AU - Valstar, Michel
N1 - Publisher Copyright:
© 2020 IEEE
PY - 2020
Y1 - 2020
N2 - Facial actions are spatio-temporal signals by nature, and therefore their modeling is crucially dependent on the availability of temporal information. In this paper, we focus on inferring such temporal dynamics of facial actions when no explicit temporal information is available, i.e. from still images. We present a novel self-supervised learning approach to capture multiple scales of temporal dynamics, with an application to facial Action Unit (AU) intensity estimation and dimensional affect estimation. In particular: 1. We propose a framework that infers a dynamic representation (DR) from a still image, capturing the bi-directional flow of time within a short time-window centered at the input image; 2. We show that the proposed rank loss can apply facial temporal evolution to self-supervise the training process without using target representations, allowing the network to represent dynamics more broadly; 3. We propose a multiple temporal scale approach that infers DRs for different window lengths (MDR) from a still image. We empirically validate the value of our approach on the task of frame ranking, and show how our proposed MDR attains state of the art results on BP4D for AU intensity estimation and on SEMAINE for dimensional affect estimation, using only still images at test time.
AB - Facial actions are spatio-temporal signals by nature, and therefore their modeling is crucially dependent on the availability of temporal information. In this paper, we focus on inferring such temporal dynamics of facial actions when no explicit temporal information is available, i.e. from still images. We present a novel self-supervised learning approach to capture multiple scales of temporal dynamics, with an application to facial Action Unit (AU) intensity estimation and dimensional affect estimation. In particular: 1. We propose a framework that infers a dynamic representation (DR) from a still image, capturing the bi-directional flow of time within a short time-window centered at the input image; 2. We show that the proposed rank loss can apply facial temporal evolution to self-supervise the training process without using target representations, allowing the network to represent dynamics more broadly; 3. We propose a multiple temporal scale approach that infers DRs for different window lengths (MDR) from a still image. We empirically validate the value of our approach on the task of frame ranking, and show how our proposed MDR attains state of the art results on BP4D for AU intensity estimation and on SEMAINE for dimensional affect estimation, using only still images at test time.
UR - http://www.scopus.com/inward/record.url?scp=85110442260&partnerID=8YFLogxK
U2 - 10.1109/ICPR48806.2021.9412942
DO - 10.1109/ICPR48806.2021.9412942
M3 - Conference contribution
AN - SCOPUS:85110442260
T3 - Proceedings - International Conference on Pattern Recognition
SP - 1619
EP - 1626
BT - Proceedings of ICPR 2020 - 25th International Conference on Pattern Recognition
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 25th International Conference on Pattern Recognition, ICPR 2020
Y2 - 10 January 2021 through 15 January 2021
ER -