TY - GEN
T1 - Polyphonic Sound Event Detection Using Capsule Neural Network on Multi-Type-Multi-Scale Time-Frequency Representation
AU - Jin, Wangkai
AU - Liu, Junyu
AU - Feng, Meili
AU - Ren, Jianfeng
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - The challenges of polyphonic sound event detection (PSED) stem from the detection of multiple overlapping events in a time series. Recent efforts exploit Deep Neural Networks (DNNs) on Time-Frequency Representations (TFRs) of audio clips as model inputs to mitigate such issues. However, existing solutions often rely on a single type of TFR, which causes under-utilization of input features. To this end, we propose a novel PSED framework, which incorporates Multi-Type-Multi-Scale TFRs. Our key insight is that: TFRs, which are of different types or in different scales, can reveal acoustics patterns in a complementary manner, so that the overlapped events can be best extracted by combining different TFRs. Moreover, our framework design applies a novel approach, to adaptively fuse different models and TFRs symbiotically. Hence, the overall performance can be significantly improved. We quantitatively examine the benefits of our framework by using Capsule Neural Networks, a state-of-the-art approach for PSED. The experimental results show that our method achieves a 7% reduction in error rate compared with the state-of-the-art solutions on the TUT-SED 2016 dataset.
AB - The challenges of polyphonic sound event detection (PSED) stem from the detection of multiple overlapping events in a time series. Recent efforts exploit Deep Neural Networks (DNNs) on Time-Frequency Representations (TFRs) of audio clips as model inputs to mitigate such issues. However, existing solutions often rely on a single type of TFR, which causes under-utilization of input features. To this end, we propose a novel PSED framework, which incorporates Multi-Type-Multi-Scale TFRs. Our key insight is that: TFRs, which are of different types or in different scales, can reveal acoustics patterns in a complementary manner, so that the overlapped events can be best extracted by combining different TFRs. Moreover, our framework design applies a novel approach, to adaptively fuse different models and TFRs symbiotically. Hence, the overall performance can be significantly improved. We quantitatively examine the benefits of our framework by using Capsule Neural Networks, a state-of-the-art approach for PSED. The experimental results show that our method achieves a 7% reduction in error rate compared with the state-of-the-art solutions on the TUT-SED 2016 dataset.
KW - capsule neural network
KW - polyphonic sound event detection
KW - time-frequency representation
UR - http://www.scopus.com/inward/record.url?scp=85136329114&partnerID=8YFLogxK
U2 - 10.1109/SEAI55746.2022.9832286
DO - 10.1109/SEAI55746.2022.9832286
M3 - Conference contribution
AN - SCOPUS:85136329114
T3 - 2022 2nd IEEE International Conference on Software Engineering and Artificial Intelligence, SEAI 2022
SP - 146
EP - 150
BT - 2022 2nd IEEE International Conference on Software Engineering and Artificial Intelligence, SEAI 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2nd IEEE International Conference on Software Engineering and Artificial Intelligence, SEAI 2022
Y2 - 10 June 2022 through 12 June 2022
ER -