To address the problem that traditional deep learning models are difficult to capture the long-context feature correlations in input feature maps as well as the key feature information in channel and spatial dimensions
resulting in high error rates and unsatisfactory performance in sound event localization and detection (SELD). Based on the baseline model SELDnet in the acoustic scene classification and sound event detection challenge
this paper proposes a feature enhanced sound event localization and detection network (FE-SELDnet). In order to address the issue of function failure to backpropagate
which leads to neuron death
it suggests using group normalization and the SiLU activation function; introducing the convolutional block attention module (CBAM) to capture significant features in both channel and spatial dimensions of acoustic features
suppressing superfluous features
improving network sensitivity and accuracy to feature information
and improving information flow; introducing the Transformer module to capture longer speech context feature association and combine local features to improve the accuracy and robustness of the model in sound event detection and localization tasks. The proposed FE-SELDnet significantly outperforms the original baseline network
according to experimental results on the TUT Sound Events dataset. The error rate decreased from 0.45 to 0.326
the SED and DOA scores decreased from 0.45 and 0.32 to 0.26 and 0.25
respectively
and the F1 score increased to 79.4%. The algorithm proposed in this paper has higher superiority.