Real-time Architecture for Audio-Visual Active Speaker Detection

Min Huang, Wen Wang, Zheyuan Lin, Fiseha B. Tesema, Shanshan Ji, Jason Gu, Minhong Wan, Wei Song, Te Li, Shiqiang Zhu

Research output: Chapter in Book/Conference proceedingConference contributionpeer-review

Abstract

Continuously measuring the speaking state of users with robot in a human-robot Interaction(HRI) system improves metrics of interaction quality. Meanwhile, mainstream active speaker detection (ASD) algorithms emphasize achieving high AUCs at frame level in the AVA-Active Speaker dataset and pay less attention to get real-time performance in robotic systems. In this paper, we propose a model named FSDNet to keep a high AUC score in the AVA-Active Speaker dataset while reducing time cost, our model increase AUC score by 0.1% compared with the State-Of-The-Art and need only 75% running time. Furthermore, we put forward an architecture with a time-related prediction function to make our algorithm more effective and generative in interactive robotic systems. The code is released at https://github.com/huangmin9966/FSDNet-RealTimeArch.

Original languageEnglish
Title of host publication2022 IEEE International Conference on Robotics and Biomimetics, ROBIO 2022
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1377-1382
Number of pages6
ISBN (Electronic)9781665481090
DOIs
Publication statusPublished - 2022
Externally publishedYes
Event2022 IEEE International Conference on Robotics and Biomimetics, ROBIO 2022 - Jinghong, China
Duration: 5 Dec 20229 Dec 2022

Publication series

Name2022 IEEE International Conference on Robotics and Biomimetics, ROBIO 2022

Conference

Conference2022 IEEE International Conference on Robotics and Biomimetics, ROBIO 2022
Country/TerritoryChina
CityJinghong
Period5/12/229/12/22

ASJC Scopus subject areas

  • Artificial Intelligence
  • Aerospace Engineering
  • Automotive Engineering
  • Control and Optimization
  • Modelling and Simulation

Fingerprint

Dive into the research topics of 'Real-time Architecture for Audio-Visual Active Speaker Detection'. Together they form a unique fingerprint.

Cite this