Real-time Architecture for Audio-Visual Active Speaker Detection

Min Huang; Wen Wang; Zheyuan Lin; Fiseha B. Tesema; Shanshan Ji; Jason Gu; Minhong Wan; Wei Song; Te Li; Shiqiang Zhu

doi:10.1109/ROBIO55434.2022.10011692

Real-time Architecture for Audio-Visual Active Speaker Detection

Min Huang, Wen Wang, Zheyuan Lin, Fiseha B. Tesema, Shanshan Ji, Jason Gu, Minhong Wan, Wei Song, Te Li, Shiqiang Zhu

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

Abstract

Continuously measuring the speaking state of users with robot in a human-robot Interaction(HRI) system improves metrics of interaction quality. Meanwhile, mainstream active speaker detection (ASD) algorithms emphasize achieving high AUCs at frame level in the AVA-Active Speaker dataset and pay less attention to get real-time performance in robotic systems. In this paper, we propose a model named FSDNet to keep a high AUC score in the AVA-Active Speaker dataset while reducing time cost, our model increase AUC score by 0.1% compared with the State-Of-The-Art and need only 75% running time. Furthermore, we put forward an architecture with a time-related prediction function to make our algorithm more effective and generative in interactive robotic systems. The code is released at https://github.com/huangmin9966/FSDNet-RealTimeArch.

Original language	English
Title of host publication	2022 IEEE International Conference on Robotics and Biomimetics, ROBIO 2022
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	1377-1382
Number of pages	6
ISBN (Electronic)	9781665481090
DOIs	https://doi.org/10.1109/ROBIO55434.2022.10011692
Publication status	Published - 2022
Externally published	Yes
Event	2022 IEEE International Conference on Robotics and Biomimetics, ROBIO 2022 - Jinghong, China Duration: 5 Dec 2022 → 9 Dec 2022

Publication series

Name	2022 IEEE International Conference on Robotics and Biomimetics, ROBIO 2022

Conference

Conference	2022 IEEE International Conference on Robotics and Biomimetics, ROBIO 2022
Country/Territory	China
City	Jinghong
Period	5/12/22 → 9/12/22

ASJC Scopus subject areas

Artificial Intelligence
Aerospace Engineering
Automotive Engineering
Control and Optimization
Modelling and Simulation

Access to Document

10.1109/ROBIO55434.2022.10011692

Cite this

Huang, M., Wang, W., Lin, Z., Tesema, F. B., Ji, S., Gu, J., Wan, M., Song, W., Li, T., & Zhu, S. (2022). Real-time Architecture for Audio-Visual Active Speaker Detection. In 2022 IEEE International Conference on Robotics and Biomimetics, ROBIO 2022 (pp. 1377-1382). (2022 IEEE International Conference on Robotics and Biomimetics, ROBIO 2022). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ROBIO55434.2022.10011692

@inproceedings{28fe4826f3a240cab8af2f155df5d6c1,

title = "Real-time Architecture for Audio-Visual Active Speaker Detection",

abstract = "Continuously measuring the speaking state of users with robot in a human-robot Interaction(HRI) system improves metrics of interaction quality. Meanwhile, mainstream active speaker detection (ASD) algorithms emphasize achieving high AUCs at frame level in the AVA-Active Speaker dataset and pay less attention to get real-time performance in robotic systems. In this paper, we propose a model named FSDNet to keep a high AUC score in the AVA-Active Speaker dataset while reducing time cost, our model increase AUC score by 0.1% compared with the State-Of-The-Art and need only 75% running time. Furthermore, we put forward an architecture with a time-related prediction function to make our algorithm more effective and generative in interactive robotic systems. The code is released at https://github.com/huangmin9966/FSDNet-RealTimeArch.",

author = "Min Huang and Wen Wang and Zheyuan Lin and Tesema, {Fiseha B.} and Shanshan Ji and Jason Gu and Minhong Wan and Wei Song and Te Li and Shiqiang Zhu",

note = "Publisher Copyright: {\textcopyright} 2022 IEEE.; 2022 IEEE International Conference on Robotics and Biomimetics, ROBIO 2022 ; Conference date: 05-12-2022 Through 09-12-2022",

year = "2022",

doi = "10.1109/ROBIO55434.2022.10011692",

language = "English",

series = "2022 IEEE International Conference on Robotics and Biomimetics, ROBIO 2022",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "1377--1382",

booktitle = "2022 IEEE International Conference on Robotics and Biomimetics, ROBIO 2022",

address = "United States",

}

Huang, M, Wang, W, Lin, Z, Tesema, FB, Ji, S, Gu, J, Wan, M, Song, W, Li, T & Zhu, S 2022, Real-time Architecture for Audio-Visual Active Speaker Detection. in 2022 IEEE International Conference on Robotics and Biomimetics, ROBIO 2022. 2022 IEEE International Conference on Robotics and Biomimetics, ROBIO 2022, Institute of Electrical and Electronics Engineers Inc., pp. 1377-1382, 2022 IEEE International Conference on Robotics and Biomimetics, ROBIO 2022, Jinghong, China, 5/12/22. https://doi.org/10.1109/ROBIO55434.2022.10011692

Real-time Architecture for Audio-Visual Active Speaker Detection. / Huang, Min; Wang, Wen; Lin, Zheyuan et al.
2022 IEEE International Conference on Robotics and Biomimetics, ROBIO 2022. Institute of Electrical and Electronics Engineers Inc., 2022. p. 1377-1382 (2022 IEEE International Conference on Robotics and Biomimetics, ROBIO 2022).

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Real-time Architecture for Audio-Visual Active Speaker Detection

AU - Huang, Min

AU - Wang, Wen

AU - Lin, Zheyuan

AU - Tesema, Fiseha B.

AU - Ji, Shanshan

AU - Gu, Jason

AU - Wan, Minhong

AU - Song, Wei

AU - Li, Te

AU - Zhu, Shiqiang

PY - 2022

Y1 - 2022

N2 - Continuously measuring the speaking state of users with robot in a human-robot Interaction(HRI) system improves metrics of interaction quality. Meanwhile, mainstream active speaker detection (ASD) algorithms emphasize achieving high AUCs at frame level in the AVA-Active Speaker dataset and pay less attention to get real-time performance in robotic systems. In this paper, we propose a model named FSDNet to keep a high AUC score in the AVA-Active Speaker dataset while reducing time cost, our model increase AUC score by 0.1% compared with the State-Of-The-Art and need only 75% running time. Furthermore, we put forward an architecture with a time-related prediction function to make our algorithm more effective and generative in interactive robotic systems. The code is released at https://github.com/huangmin9966/FSDNet-RealTimeArch.

AB - Continuously measuring the speaking state of users with robot in a human-robot Interaction(HRI) system improves metrics of interaction quality. Meanwhile, mainstream active speaker detection (ASD) algorithms emphasize achieving high AUCs at frame level in the AVA-Active Speaker dataset and pay less attention to get real-time performance in robotic systems. In this paper, we propose a model named FSDNet to keep a high AUC score in the AVA-Active Speaker dataset while reducing time cost, our model increase AUC score by 0.1% compared with the State-Of-The-Art and need only 75% running time. Furthermore, we put forward an architecture with a time-related prediction function to make our algorithm more effective and generative in interactive robotic systems. The code is released at https://github.com/huangmin9966/FSDNet-RealTimeArch.

UR - http://www.scopus.com/inward/record.url?scp=85147334252&partnerID=8YFLogxK

U2 - 10.1109/ROBIO55434.2022.10011692

DO - 10.1109/ROBIO55434.2022.10011692

M3 - Conference contribution

AN - SCOPUS:85147334252

T3 - 2022 IEEE International Conference on Robotics and Biomimetics, ROBIO 2022

SP - 1377

EP - 1382

BT - 2022 IEEE International Conference on Robotics and Biomimetics, ROBIO 2022

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2022 IEEE International Conference on Robotics and Biomimetics, ROBIO 2022

Y2 - 5 December 2022 through 9 December 2022

ER -

Huang M, Wang W, Lin Z, Tesema FB, Ji S, Gu J et al. Real-time Architecture for Audio-Visual Active Speaker Detection. In 2022 IEEE International Conference on Robotics and Biomimetics, ROBIO 2022. Institute of Electrical and Electronics Engineers Inc. 2022. p. 1377-1382. (2022 IEEE International Conference on Robotics and Biomimetics, ROBIO 2022). doi: 10.1109/ROBIO55434.2022.10011692

Real-time Architecture for Audio-Visual Active Speaker Detection

Abstract

Publication series

Conference

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this