Addressee detection using facial and audio features in mixed human–human and human–robot settings: a deep learning framework

Fiseha Berhanu Tesema; Jason Gu; Wei Song; Hong Wu; Shiqiang Zhu; Zheyuan Lin; Min Huang; Wen Wang; Rajesh  Kumar

doi:10.1109/MSMC.2022.3224843

Addressee detection using facial and audio features in mixed human–human and human–robot settings: a deep learning framework

Fiseha Berhanu Tesema, Jason Gu, Wei Song, Hong Wu, Shiqiang Zhu, Zheyuan Lin, Min Huang, Wen Wang, Rajesh Kumar

School of Computer Science

Research output: Journal Publication › Article › peer-review

Abstract

Addressee detection (AD) enables robots to interact smoothly with a human by distinguishing whether it is being addressed. However, this has not been widely explored. The few studies that have explored this area focused on a human-to-human or human-to-robot conversation confined inside a meeting room using gaze and utterance. These works used statistical and rule-based approaches, which tend to depend on specific settings. Further, they did not fully leverage the available audio and visual information or the short-term and long-term segments, and they have not explored combining important conversation cues—the facial and audio features. In addition, no audiovisual spatiotemporal annotated dataset captured in mixed human-to-human and human-to-robot settings is available to support exploring the area using new approaches.

Original language	English
Article number	22959594
Pages (from-to)	25-38
Journal	IEEE Systems, Man, and Cybernetics Magazine
Volume	9
Issue number	2
DOIs	https://doi.org/10.1109/MSMC.2022.3224843
Publication status	Published - 18 Apr 2023

Keywords

Deep learning
Visualization
Annotations
Input variables
Human-robot interaction
Oral communication
Predictive models

Access to Document

10.1109/MSMC.2022.3224843

Cite this

@article{348735d4bd2e458e856cad1beb457d83,

title = "Addressee detection using facial and audio features in mixed human–human and human–robot settings: a deep learning framework",

abstract = "Addressee detection (AD) enables robots to interact smoothly with a human by distinguishing whether it is being addressed. However, this has not been widely explored. The few studies that have explored this area focused on a human-to-human or human-to-robot conversation confined inside a meeting room using gaze and utterance. These works used statistical and rule-based approaches, which tend to depend on specific settings. Further, they did not fully leverage the available audio and visual information or the short-term and long-term segments, and they have not explored combining important conversation cues—the facial and audio features. In addition, no audiovisual spatiotemporal annotated dataset captured in mixed human-to-human and human-to-robot settings is available to support exploring the area using new approaches.",

keywords = "Deep learning, Visualization, Annotations, Input variables, Human-robot interaction, Oral communication, Predictive models",

author = "Tesema, {Fiseha Berhanu} and Jason Gu and Wei Song and Hong Wu and Shiqiang Zhu and Zheyuan Lin and Min Huang and Wen Wang and Rajesh Kumar",

year = "2023",

month = apr,

day = "18",

doi = "10.1109/MSMC.2022.3224843",

language = "English",

volume = "9",

pages = "25--38",

journal = "IEEE Systems, Man, and Cybernetics Magazine",

issn = "2333-942X",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "2",

}

TY - JOUR

T1 - Addressee detection using facial and audio features in mixed human–human and human–robot settings: a deep learning framework

AU - Tesema, Fiseha Berhanu

AU - Gu, Jason

AU - Song, Wei

AU - Wu, Hong

AU - Zhu, Shiqiang

AU - Lin, Zheyuan

AU - Huang, Min

AU - Wang, Wen

AU - Kumar, Rajesh

PY - 2023/4/18

Y1 - 2023/4/18

N2 - Addressee detection (AD) enables robots to interact smoothly with a human by distinguishing whether it is being addressed. However, this has not been widely explored. The few studies that have explored this area focused on a human-to-human or human-to-robot conversation confined inside a meeting room using gaze and utterance. These works used statistical and rule-based approaches, which tend to depend on specific settings. Further, they did not fully leverage the available audio and visual information or the short-term and long-term segments, and they have not explored combining important conversation cues—the facial and audio features. In addition, no audiovisual spatiotemporal annotated dataset captured in mixed human-to-human and human-to-robot settings is available to support exploring the area using new approaches.

AB - Addressee detection (AD) enables robots to interact smoothly with a human by distinguishing whether it is being addressed. However, this has not been widely explored. The few studies that have explored this area focused on a human-to-human or human-to-robot conversation confined inside a meeting room using gaze and utterance. These works used statistical and rule-based approaches, which tend to depend on specific settings. Further, they did not fully leverage the available audio and visual information or the short-term and long-term segments, and they have not explored combining important conversation cues—the facial and audio features. In addition, no audiovisual spatiotemporal annotated dataset captured in mixed human-to-human and human-to-robot settings is available to support exploring the area using new approaches.

KW - Deep learning

KW - Visualization

KW - Annotations

KW - Input variables

KW - Human-robot interaction

KW - Oral communication

KW - Predictive models

U2 - 10.1109/MSMC.2022.3224843

DO - 10.1109/MSMC.2022.3224843

M3 - Article

SN - 2333-942X

VL - 9

SP - 25

EP - 38

JO - IEEE Systems, Man, and Cybernetics Magazine

JF - IEEE Systems, Man, and Cybernetics Magazine

IS - 2

M1 - 22959594

ER -

Addressee detection using facial and audio features in mixed human–human and human–robot settings: a deep learning framework

Abstract

Keywords

Access to Document

Fingerprint

Cite this