Dynamic interactive learning network for audio-visual event localization

Jincai Chen; Han Liang; Ruili Wang; Jiangfeng Zeng; Ping Lu

doi:10.1007/s10489-023-05146-7

Dynamic interactive learning network for audio-visual event localization

Jincai Chen, Han Liang, Ruili Wang, Jiangfeng Zeng, Ping Lu

Research output: Journal Publication › Article › peer-review

1 Citation (Scopus)

Abstract

Audio-visual event (AVE) localization aims to detect whether an event exists in each video segment and predict its category. Only when the event is audible and visible can it be recognized as an AVE. However, sometimes the information from auditory and visual modalities is asymmetrical in a video sequence, leading to incorrect predictions. To address this challenge, we introduce a dynamic interactive learning network designed to dynamically explore the intra- and inter-modal relationships depending on the other modality for better AVE localization. Specifically, our approach involves a dynamic fusion attention of intra- and inter-modalities module, enabling the auditory and visual modalities to focus more on regions deemed informative by the other modality while focusing less on regions that the other modality considers noise. In addition, we introduce an audio-visual difference loss to reduce the distance between auditory and visual representations. Our proposed method has been demonstrated to have superior performance by extensive experimental results on the AVE dataset. The source code will be available at https://github.com/hanliang/DILN .

Original language	English
Pages (from-to)	30431-30442
Number of pages	12
Journal	Applied Intelligence
Volume	53
Issue number	24
DOIs	https://doi.org/10.1007/s10489-023-05146-7
Publication status	Published - Dec 2023
Externally published	Yes

Keywords

Attention mechanism
Audio-visual event localization
Difference loss
Dynamic fusion

ASJC Scopus subject areas

Artificial Intelligence

Access to Document

10.1007/s10489-023-05146-7

Cite this

@article{839006313f564dd887fb97b89a1fecf7,

title = "Dynamic interactive learning network for audio-visual event localization",

abstract = "Audio-visual event (AVE) localization aims to detect whether an event exists in each video segment and predict its category. Only when the event is audible and visible can it be recognized as an AVE. However, sometimes the information from auditory and visual modalities is asymmetrical in a video sequence, leading to incorrect predictions. To address this challenge, we introduce a dynamic interactive learning network designed to dynamically explore the intra- and inter-modal relationships depending on the other modality for better AVE localization. Specifically, our approach involves a dynamic fusion attention of intra- and inter-modalities module, enabling the auditory and visual modalities to focus more on regions deemed informative by the other modality while focusing less on regions that the other modality considers noise. In addition, we introduce an audio-visual difference loss to reduce the distance between auditory and visual representations. Our proposed method has been demonstrated to have superior performance by extensive experimental results on the AVE dataset. The source code will be available at https://github.com/hanliang/DILN .",

keywords = "Attention mechanism, Audio-visual event localization, Difference loss, Dynamic fusion",

author = "Jincai Chen and Han Liang and Ruili Wang and Jiangfeng Zeng and Ping Lu",

note = "Publisher Copyright: {\textcopyright} 2023, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.",

year = "2023",

month = dec,

doi = "10.1007/s10489-023-05146-7",

language = "English",

volume = "53",

pages = "30431--30442",

journal = "Applied Intelligence",

issn = "0924-669X",

publisher = "Springer Netherlands",

number = "24",

}

TY - JOUR

T1 - Dynamic interactive learning network for audio-visual event localization

AU - Chen, Jincai

AU - Liang, Han

AU - Wang, Ruili

AU - Zeng, Jiangfeng

AU - Lu, Ping

PY - 2023/12

Y1 - 2023/12

N2 - Audio-visual event (AVE) localization aims to detect whether an event exists in each video segment and predict its category. Only when the event is audible and visible can it be recognized as an AVE. However, sometimes the information from auditory and visual modalities is asymmetrical in a video sequence, leading to incorrect predictions. To address this challenge, we introduce a dynamic interactive learning network designed to dynamically explore the intra- and inter-modal relationships depending on the other modality for better AVE localization. Specifically, our approach involves a dynamic fusion attention of intra- and inter-modalities module, enabling the auditory and visual modalities to focus more on regions deemed informative by the other modality while focusing less on regions that the other modality considers noise. In addition, we introduce an audio-visual difference loss to reduce the distance between auditory and visual representations. Our proposed method has been demonstrated to have superior performance by extensive experimental results on the AVE dataset. The source code will be available at https://github.com/hanliang/DILN .

AB - Audio-visual event (AVE) localization aims to detect whether an event exists in each video segment and predict its category. Only when the event is audible and visible can it be recognized as an AVE. However, sometimes the information from auditory and visual modalities is asymmetrical in a video sequence, leading to incorrect predictions. To address this challenge, we introduce a dynamic interactive learning network designed to dynamically explore the intra- and inter-modal relationships depending on the other modality for better AVE localization. Specifically, our approach involves a dynamic fusion attention of intra- and inter-modalities module, enabling the auditory and visual modalities to focus more on regions deemed informative by the other modality while focusing less on regions that the other modality considers noise. In addition, we introduce an audio-visual difference loss to reduce the distance between auditory and visual representations. Our proposed method has been demonstrated to have superior performance by extensive experimental results on the AVE dataset. The source code will be available at https://github.com/hanliang/DILN .

KW - Attention mechanism

KW - Audio-visual event localization

KW - Difference loss

KW - Dynamic fusion

UR - http://www.scopus.com/inward/record.url?scp=85176804607&partnerID=8YFLogxK

U2 - 10.1007/s10489-023-05146-7

DO - 10.1007/s10489-023-05146-7

M3 - Article

AN - SCOPUS:85176804607

SN - 0924-669X

VL - 53

SP - 30431

EP - 30442

JO - Applied Intelligence

JF - Applied Intelligence

IS - 24

ER -

Dynamic interactive learning network for audio-visual event localization

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this