Abstract
Audio-visual event (AVE) localization aims to detect whether an event exists in each video segment and predict its category. Only when the event is audible and visible can it be recognized as an AVE. However, sometimes the information from auditory and visual modalities is asymmetrical in a video sequence, leading to incorrect predictions. To address this challenge, we introduce a dynamic interactive learning network designed to dynamically explore the intra- and inter-modal relationships depending on the other modality for better AVE localization. Specifically, our approach involves a dynamic fusion attention of intra- and inter-modalities module, enabling the auditory and visual modalities to focus more on regions deemed informative by the other modality while focusing less on regions that the other modality considers noise. In addition, we introduce an audio-visual difference loss to reduce the distance between auditory and visual representations. Our proposed method has been demonstrated to have superior performance by extensive experimental results on the AVE dataset. The source code will be available at https://github.com/hanliang/DILN .
Original language | English |
---|---|
Pages (from-to) | 30431-30442 |
Number of pages | 12 |
Journal | Applied Intelligence |
Volume | 53 |
Issue number | 24 |
DOIs | |
Publication status | Published - Dec 2023 |
Externally published | Yes |
Keywords
- Attention mechanism
- Audio-visual event localization
- Difference loss
- Dynamic fusion
ASJC Scopus subject areas
- Artificial Intelligence