End-To-End Audiovisual Feature Fusion for Active Speaker Detection

Fiseha B. Tesema; Zheyuan Lin; Shiqiang Zhu; Wei Song; Jason Gu; Hong Wu

doi:10.1117/12.2643881

End-To-End Audiovisual Feature Fusion for Active Speaker Detection

Fiseha B. Tesema, Zheyuan Lin, Shiqiang Zhu, Wei Song, Jason Gu, Hong Wu

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

2 Citations (Scopus)

Abstract

Active speaker detection plays a vital role in human-machine interaction. Recently, a few end-to-end audiovisual frameworks emerged. However, these models' inference time was not explored and are not applicable for real-time applications due to their complexity and large input size. In addition, they explored a similar feature extraction strategy that employs the ConvNet on audio and visual inputs. This work presents a novel two-stream end-to-end framework fusing features extracted from images via VGG-M with raw Mel Frequency Cepstrum Coefficients features extracted from the audio waveform. The network has two BiGRU layers attached to each stream to handle each stream's temporal dynamic before fusion. After fusion, one BiGRU layer is attached to model the joint temporal dynamics. The experiment result on the AVA-ActiveSpeaker dataset indicates that our new feature extraction strategy shows more robustness to noisy signals and better inference time than models that employed ConvNet on both modalities. The proposed model predicts within 44.41 ms, which is fast enough for real-time applications. Our best-performing model attained 88.929% accuracy, nearly the same detection result as state-of-the-art work.

Original language	English
Title of host publication	Fourteenth International Conference on Digital Image Processing, ICDIP 2022
Editors	Xudong Jiang, Wenbing Tao, Deze Zeng, Yi Xie
Publisher	SPIE
ISBN (Electronic)	9781510657564
DOIs	https://doi.org/10.1117/12.2643881
Publication status	Published - 2022
Externally published	Yes
Event	14th International Conference on Digital Image Processing, ICDIP 2022 - Wuhan, China Duration: 20 May 2022 → 23 May 2022

Publication series

Name	Proceedings of SPIE - The International Society for Optical Engineering
Volume	12342
ISSN (Print)	0277-786X
ISSN (Electronic)	1996-756X

Conference

Conference	14th International Conference on Digital Image Processing, ICDIP 2022
Country/Territory	China
City	Wuhan
Period	20/05/22 → 23/05/22

Keywords

Audiovisual active speaker detection
Audiovisual fusion
BiGRU
MFCC
VGG-M

ASJC Scopus subject areas

Electronic, Optical and Magnetic Materials
Condensed Matter Physics
Computer Science Applications
Applied Mathematics
Electrical and Electronic Engineering

Access to Document

10.1117/12.2643881

Cite this

Tesema, F. B., Lin, Z., Zhu, S., Song, W., Gu, J., & Wu, H. (2022). End-To-End Audiovisual Feature Fusion for Active Speaker Detection. In X. Jiang, W. Tao, D. Zeng, & Y. Xie (Eds.), Fourteenth International Conference on Digital Image Processing, ICDIP 2022 Article 123422A (Proceedings of SPIE - The International Society for Optical Engineering; Vol. 12342). SPIE. https://doi.org/10.1117/12.2643881

@inproceedings{52f0f5ddc9ec4913a0bd3cfd33cda8f0,

title = "End-To-End Audiovisual Feature Fusion for Active Speaker Detection",

abstract = "Active speaker detection plays a vital role in human-machine interaction. Recently, a few end-to-end audiovisual frameworks emerged. However, these models' inference time was not explored and are not applicable for real-time applications due to their complexity and large input size. In addition, they explored a similar feature extraction strategy that employs the ConvNet on audio and visual inputs. This work presents a novel two-stream end-to-end framework fusing features extracted from images via VGG-M with raw Mel Frequency Cepstrum Coefficients features extracted from the audio waveform. The network has two BiGRU layers attached to each stream to handle each stream's temporal dynamic before fusion. After fusion, one BiGRU layer is attached to model the joint temporal dynamics. The experiment result on the AVA-ActiveSpeaker dataset indicates that our new feature extraction strategy shows more robustness to noisy signals and better inference time than models that employed ConvNet on both modalities. The proposed model predicts within 44.41 ms, which is fast enough for real-time applications. Our best-performing model attained 88.929% accuracy, nearly the same detection result as state-of-the-art work.",

keywords = "Audiovisual active speaker detection, Audiovisual fusion, BiGRU, MFCC, VGG-M",

author = "Tesema, {Fiseha B.} and Zheyuan Lin and Shiqiang Zhu and Wei Song and Jason Gu and Hong Wu",

note = "Publisher Copyright: {\textcopyright} 2022 SPIE.; 14th International Conference on Digital Image Processing, ICDIP 2022 ; Conference date: 20-05-2022 Through 23-05-2022",

year = "2022",

doi = "10.1117/12.2643881",

language = "English",

series = "Proceedings of SPIE - The International Society for Optical Engineering",

publisher = "SPIE",

editor = "Xudong Jiang and Wenbing Tao and Deze Zeng and Yi Xie",

booktitle = "Fourteenth International Conference on Digital Image Processing, ICDIP 2022",

address = "United States",

}

Tesema, FB, Lin, Z, Zhu, S, Song, W, Gu, J & Wu, H 2022, End-To-End Audiovisual Feature Fusion for Active Speaker Detection. in X Jiang, W Tao, D Zeng & Y Xie (eds), Fourteenth International Conference on Digital Image Processing, ICDIP 2022., 123422A, Proceedings of SPIE - The International Society for Optical Engineering, vol. 12342, SPIE, 14th International Conference on Digital Image Processing, ICDIP 2022, Wuhan, China, 20/05/22. https://doi.org/10.1117/12.2643881

End-To-End Audiovisual Feature Fusion for Active Speaker Detection. / Tesema, Fiseha B.; Lin, Zheyuan; Zhu, Shiqiang et al.
Fourteenth International Conference on Digital Image Processing, ICDIP 2022. ed. / Xudong Jiang; Wenbing Tao; Deze Zeng; Yi Xie. SPIE, 2022. 123422A (Proceedings of SPIE - The International Society for Optical Engineering; Vol. 12342).

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - End-To-End Audiovisual Feature Fusion for Active Speaker Detection

AU - Tesema, Fiseha B.

AU - Lin, Zheyuan

AU - Zhu, Shiqiang

AU - Song, Wei

AU - Gu, Jason

AU - Wu, Hong

PY - 2022

Y1 - 2022

N2 - Active speaker detection plays a vital role in human-machine interaction. Recently, a few end-to-end audiovisual frameworks emerged. However, these models' inference time was not explored and are not applicable for real-time applications due to their complexity and large input size. In addition, they explored a similar feature extraction strategy that employs the ConvNet on audio and visual inputs. This work presents a novel two-stream end-to-end framework fusing features extracted from images via VGG-M with raw Mel Frequency Cepstrum Coefficients features extracted from the audio waveform. The network has two BiGRU layers attached to each stream to handle each stream's temporal dynamic before fusion. After fusion, one BiGRU layer is attached to model the joint temporal dynamics. The experiment result on the AVA-ActiveSpeaker dataset indicates that our new feature extraction strategy shows more robustness to noisy signals and better inference time than models that employed ConvNet on both modalities. The proposed model predicts within 44.41 ms, which is fast enough for real-time applications. Our best-performing model attained 88.929% accuracy, nearly the same detection result as state-of-the-art work.

AB - Active speaker detection plays a vital role in human-machine interaction. Recently, a few end-to-end audiovisual frameworks emerged. However, these models' inference time was not explored and are not applicable for real-time applications due to their complexity and large input size. In addition, they explored a similar feature extraction strategy that employs the ConvNet on audio and visual inputs. This work presents a novel two-stream end-to-end framework fusing features extracted from images via VGG-M with raw Mel Frequency Cepstrum Coefficients features extracted from the audio waveform. The network has two BiGRU layers attached to each stream to handle each stream's temporal dynamic before fusion. After fusion, one BiGRU layer is attached to model the joint temporal dynamics. The experiment result on the AVA-ActiveSpeaker dataset indicates that our new feature extraction strategy shows more robustness to noisy signals and better inference time than models that employed ConvNet on both modalities. The proposed model predicts within 44.41 ms, which is fast enough for real-time applications. Our best-performing model attained 88.929% accuracy, nearly the same detection result as state-of-the-art work.

KW - Audiovisual active speaker detection

KW - Audiovisual fusion

KW - BiGRU

KW - MFCC

KW - VGG-M

UR - http://www.scopus.com/inward/record.url?scp=85141887120&partnerID=8YFLogxK

U2 - 10.1117/12.2643881

DO - 10.1117/12.2643881

M3 - Conference contribution

AN - SCOPUS:85141887120

T3 - Proceedings of SPIE - The International Society for Optical Engineering

BT - Fourteenth International Conference on Digital Image Processing, ICDIP 2022

A2 - Jiang, Xudong

A2 - Tao, Wenbing

A2 - Zeng, Deze

A2 - Xie, Yi

PB - SPIE

T2 - 14th International Conference on Digital Image Processing, ICDIP 2022

Y2 - 20 May 2022 through 23 May 2022

ER -

End-To-End Audiovisual Feature Fusion for Active Speaker Detection

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this