全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

基于门控复归单位(GRU)和多头注意机制的语音情感识别模型
A Speech Emotion Recognition Model Based on Gated Recurrent Units (GRU) and Multi-Head Attention Mechanism

DOI: 10.12677/airr.2024.132038, PP. 363-374

Keywords: 语音情感识别(SER),门控复归单位(GRU),多头注意机制,Bi-GRU,深度学习
Speech Emotion Recognition (SER)
, Gated Recurrent Units (GRU), Multi-Head Attention Mechanism, Bi-GRU, Deep Learning

Full-Text   Cite this paper   Add to My Lib

Abstract:

本研究提出了一种基于门控复归单位(GRU)和多头注意机制的语音情感识别模型。随着人工智能和情感计算的进步,该模型旨在分析语音信号中的情感信息,以识别说话者的情感状态,包括喜怒哀乐等各种情感表达。这一技术在情感智能、智能客服和人机交互等领域有着广阔的应用前景。本研究结合了GRU的时序信息处理能力和多头注意机制对重要特征的关注度提升,构建了一个有效且精确的语音情感识别模型。实验结果表明,此模型在IEMOCAP和Emo-DB数据集上分别实现了81.04%和94.93%的未加权准确率,相较于已有模型有显著提升。此外,该模型还展现出良好的泛化性能和可扩展性,为智能语音交互、情感计算等领域提供了可靠的技术支持。
This study proposes a speech emotion recognition model based on Gated Recurrent Units (GRU) and a multi-head attention mechanism. With the advancement of artificial intelligence and affective computing, the model aims to analyze emotional information in speech signals to identify the emotional states of speakers, encompassing various expressions such as joy, anger, sadness, and others. This technology holds broad application prospects in affective intelligence, intelligent customer service, and human-computer interaction. Integrating the temporal information processing capability of GRU and the elevated attention to crucial features by the multi-head attention mechanism, an effective and precise speech emotion recognition model is developed. Experimental results demonstrate that this model achieved an unweighted accuracy of 81.04% on the IEMOCAP dataset and 94.93% on the Emo-DB dataset, showing significant improvement compared to existing models. Additionally, the model exhibits good generalization performance and scalability, providing reliable technical support for intelligent speech interaction, affective computing, and related fields.

References

[1]  耿磊, 傅洪亮, 陶华伟, 等. 基于动态卷积递归神经网络的语音情感识别[J]. 计算机工程, 2023, 49(4): 125-130.
https://doi.org/10.19678/j.issn.1000-3428.0064054
[2]  Tang, H., Zhang, X., Cheng, N., Xiao, J., Wang, J. (2024) ED-TTS: Multi-Scale Emotion Modeling Using Cross-Domain Emotion Diarization for Emotional Speech Synthesis. Seoul, 14-19 April 2024, 12146-12150.
https://doi.org/10.1109/ICASSP48485.2024.10446467
[3]  Zou, H., Si, Y., Chen, C., et al. (2022) Speech Emotion Recognition with Co-Attention Based Multi-Level Acoustic Information. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23-27 May 2022, 7367-7371.
https://doi.org/10.1109/ICASSP43922.2022.9747095
[4]  Kim, Y. (2014) Convolutional Neural Networks for Sentence Classification. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, 25-29 October 2014, 1746-1751.
https://doi.org/10.3115/v1/D14-1181
[5]  Badshah, A.M., Rahim, N., Ullah, N., Ahmad, J., Muhammad, K., Lee, M.Y., Kwon, S. and Baik, S.W. (2017) Deep Features-Based Speech Emotion Recognition for Smart Affective Services. Multimedia Tools and Applications, 78, 5571-5589.
https://doi.org/10.1007/s11042-017-5292-7
[6]  Sak, H., Senior, A., Rao, K., et al. (2015) Learning Acoustic Frame Labeling for Speech Recognition with Recurrent Neural Networks. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, 19-24 April 2015, 4280-4284.
https://doi.org/10.1109/ICASSP.2015.7178778
[7]  Tao, F. and Liu, G. (2018) Advanced LSTM: A Study about Better Time Dependency Modeling in Emotion Recognition. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, 15-20 April 2018, 2906-2910.
https://doi.org/10.1109/ICASSP.2018.8461750
[8]  Moritz, N., Hori, T. and Roux, J.L. (2019) Triggered Attention for End-to-end Speech Recognition. Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12-17 May 2019, 5666-5670.
https://doi.org/10.1109/ICASSP.2019.8683510
[9]  Chiu, C.C., Sainath, T.N., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., Kannan, A., Weiss, R.J., Rao, K., Gonina, E., et al. (2018) State-of-the-Art Speech Recognition with Sequence-to-Sequence Models. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, 15-20 April 2018, 4774-4778.
https://doi.org/10.1109/ICASSP.2018.8462105
[10]  Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L. and Polosukhin, I. (2017) Attention Is All You Need. Proceedings of the Neural Information Processing Systems, Long Beach, CA, 4-9 December 2017, 1-11.
[11]  Zhao, J., Mao, X. and Chen, L. (2019) Speech Emotion Recognition Using Deep 1D & 2D CNN LSTM Networks. Biomedical Signal Processing and Control, 47, 312-323.
https://doi.org/10.1016/j.bspc.2018.08.035
[12]  Sainath, T.N., Vinyals, O., Senior, A. and Sak, H. (2015) Convolutional, Long Short-Term Memory, Fully Connected Deep Neural Networks. Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, 19-24 April 2015, 4580-4584.
https://doi.org/10.1109/ICASSP.2015.7178838
[13]  Chen, M. and Zhao, X. (2020) A Multi-Scale Fusion Framework for Bimodal Speech Emotion Recognition. Proceedings of the Interspeech 2020, Shanghai, 25-29 October 2020, 374-378.
https://doi.org/10.21437/Interspeech.2020-3156
[14]  Yu, W., Xu, H., Meng, F., et al. (2020) Ch-SIMS: A Chinese Multimodal Sentiment Analysis Dataset with Fine-Grained Annotation of Modality. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistic, July 2020, 3718-3727.
https://doi.org/10.18653/v1/2020.acl-main.343
[15]  Zadeh, A., Zellers, R., Pincus, E., et al. (2016) Mosi: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos. arXiv:1606.06259.
[16]  Busso, C., Bulut, M., Lee, C.C., et al. (2008) IEMOCAP: Interactive Emotional Dyadic Motion Capture Database. Language Resources and Evaluation, 42, 335-359.
https://doi.org/10.1007/s10579-008-9076-6
[17]  Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W. and Taylor, J.G. (2001) Emotion Recognition in Human-Computer Interaction. IEEE Signal Processing Magazine, 18, 32-80.
https://doi.org/10.1109/79.911197
[18]  Latif, S., Rana, R., Khalifa, S., Jurdak, R. and Schuller, B. (2022) Self Supervised Adversarial Domain Adaptation for Cross-Corpus and Cross-Language Speech Emotion Recognition. IEEE Trans. Affective Computing, 14, 1912-1926.
https://doi.org/10.1109/TAFFC.2022.3167013
[19]  Mustaqeem, Sajjad, M. and Kwon, S. (2020) Clustering-Based Speech Emotion Recognition by Incorporating Learned Features and Deep BiLSTM. IEEE Access, 8, 79861-79875.
https://doi.org/10.1109/ACCESS.2020.2990405

Full-Text

comments powered by Disqus

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133

WeChat 1538708413