Orthogonalization-Guided Feature Fusion Network for Multimodal 2D+3D Facial Expression Recognition

Shisong Lin; Mengchao Bai; Feng Liu; Linlin Shen; Yicong Zhou

doi:10.1109/TMM.2020.3001497

Orthogonalization-Guided Feature Fusion Network for Multimodal 2D+3D Facial Expression Recognition

Shisong Lin, Mengchao Bai, Feng Liu, Linlin Shen, Yicong Zhou

Research output: Journal Publication › Article › peer-review

18 Citations (Scopus)

Abstract

As 2D and 3D data present different views of the same face, the features extracted from them can be both complementary and redundant. In this paper, we present a novel and efficient orthogonalization-guided feature fusion network, namely OGF^2Net, to fuse the features extracted from 2D and 3D faces for facial expression recognition. While 2D texture maps are fed into a 2D feature extraction pipeline (FE2DNet), the attribute maps generated from 3D data are concatenated as input of the 3D feature extraction pipeline (FE3DNet). The two networks are separately trained at the first stage and frozen in the second stage for late feature fusion, which can well address the unavailability of a large number of 3D+2D face pairs. To reduce the redundancies among features extracted from 2D and 3D streams, we design an orthogonal loss-guided feature fusion network to orthogonalize the features before fusing them. Experimental results show that the proposed method significantly outperforms the state-of-the-art algorithms on both the BU-3DFE and Bosphorus databases. While accuracies as high as 89.05% (P1 protocol) and 89.07% (P2 protocol) are achieved on the BU-3DFE database, an accuracy of 89.28% is achieved on the Bosphorus database. The complexity analysis also suggests that our approach achieves a higher processing speed while simultaneously requiring lower memory costs.

Original language	English
Article number	9115253
Pages (from-to)	1581-1591
Number of pages	11
Journal	IEEE Transactions on Multimedia
Volume	23
DOIs	https://doi.org/10.1109/TMM.2020.3001497
Publication status	Published - 2021
Externally published	Yes

Keywords

Multimodal facial expression recognition
feature fusion

ASJC Scopus subject areas

Signal Processing
Media Technology
Computer Science Applications
Electrical and Electronic Engineering

Access to Document

10.1109/TMM.2020.3001497

Cite this

@article{afe2552df9e24ac683119bffe1673a2f,

title = "Orthogonalization-Guided Feature Fusion Network for Multimodal 2D+3D Facial Expression Recognition",

abstract = "As 2D and 3D data present different views of the same face, the features extracted from them can be both complementary and redundant. In this paper, we present a novel and efficient orthogonalization-guided feature fusion network, namely OGF^2Net, to fuse the features extracted from 2D and 3D faces for facial expression recognition. While 2D texture maps are fed into a 2D feature extraction pipeline (FE2DNet), the attribute maps generated from 3D data are concatenated as input of the 3D feature extraction pipeline (FE3DNet). The two networks are separately trained at the first stage and frozen in the second stage for late feature fusion, which can well address the unavailability of a large number of 3D+2D face pairs. To reduce the redundancies among features extracted from 2D and 3D streams, we design an orthogonal loss-guided feature fusion network to orthogonalize the features before fusing them. Experimental results show that the proposed method significantly outperforms the state-of-the-art algorithms on both the BU-3DFE and Bosphorus databases. While accuracies as high as 89.05% (P1 protocol) and 89.07% (P2 protocol) are achieved on the BU-3DFE database, an accuracy of 89.28% is achieved on the Bosphorus database. The complexity analysis also suggests that our approach achieves a higher processing speed while simultaneously requiring lower memory costs.",

keywords = "Multimodal facial expression recognition, feature fusion",

author = "Shisong Lin and Mengchao Bai and Feng Liu and Linlin Shen and Yicong Zhou",

note = "Publisher Copyright: {\textcopyright} 1999-2012 IEEE.",

year = "2021",

doi = "10.1109/TMM.2020.3001497",

language = "English",

volume = "23",

pages = "1581--1591",

journal = "IEEE Transactions on Multimedia",

issn = "1520-9210",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Orthogonalization-Guided Feature Fusion Network for Multimodal 2D+3D Facial Expression Recognition

AU - Lin, Shisong

AU - Bai, Mengchao

AU - Liu, Feng

AU - Shen, Linlin

AU - Zhou, Yicong

PY - 2021

Y1 - 2021

N2 - As 2D and 3D data present different views of the same face, the features extracted from them can be both complementary and redundant. In this paper, we present a novel and efficient orthogonalization-guided feature fusion network, namely OGF^2Net, to fuse the features extracted from 2D and 3D faces for facial expression recognition. While 2D texture maps are fed into a 2D feature extraction pipeline (FE2DNet), the attribute maps generated from 3D data are concatenated as input of the 3D feature extraction pipeline (FE3DNet). The two networks are separately trained at the first stage and frozen in the second stage for late feature fusion, which can well address the unavailability of a large number of 3D+2D face pairs. To reduce the redundancies among features extracted from 2D and 3D streams, we design an orthogonal loss-guided feature fusion network to orthogonalize the features before fusing them. Experimental results show that the proposed method significantly outperforms the state-of-the-art algorithms on both the BU-3DFE and Bosphorus databases. While accuracies as high as 89.05% (P1 protocol) and 89.07% (P2 protocol) are achieved on the BU-3DFE database, an accuracy of 89.28% is achieved on the Bosphorus database. The complexity analysis also suggests that our approach achieves a higher processing speed while simultaneously requiring lower memory costs.

AB - As 2D and 3D data present different views of the same face, the features extracted from them can be both complementary and redundant. In this paper, we present a novel and efficient orthogonalization-guided feature fusion network, namely OGF^2Net, to fuse the features extracted from 2D and 3D faces for facial expression recognition. While 2D texture maps are fed into a 2D feature extraction pipeline (FE2DNet), the attribute maps generated from 3D data are concatenated as input of the 3D feature extraction pipeline (FE3DNet). The two networks are separately trained at the first stage and frozen in the second stage for late feature fusion, which can well address the unavailability of a large number of 3D+2D face pairs. To reduce the redundancies among features extracted from 2D and 3D streams, we design an orthogonal loss-guided feature fusion network to orthogonalize the features before fusing them. Experimental results show that the proposed method significantly outperforms the state-of-the-art algorithms on both the BU-3DFE and Bosphorus databases. While accuracies as high as 89.05% (P1 protocol) and 89.07% (P2 protocol) are achieved on the BU-3DFE database, an accuracy of 89.28% is achieved on the Bosphorus database. The complexity analysis also suggests that our approach achieves a higher processing speed while simultaneously requiring lower memory costs.

KW - Multimodal facial expression recognition

KW - feature fusion

UR - http://www.scopus.com/inward/record.url?scp=85104947893&partnerID=8YFLogxK

U2 - 10.1109/TMM.2020.3001497

DO - 10.1109/TMM.2020.3001497

M3 - Article

AN - SCOPUS:85104947893

SN - 1520-9210

VL - 23

SP - 1581

EP - 1591

JO - IEEE Transactions on Multimedia

JF - IEEE Transactions on Multimedia

M1 - 9115253

ER -

Orthogonalization-Guided Feature Fusion Network for Multimodal 2D+3D Facial Expression Recognition

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this