Speech Emotion Recognition Based on Constrained Bi-channel Model

doi:10.12068/j.issn.1005-3026.2023.11.003

Abstract

Abstract: To address the problem of insufficient speech features in speech emotion recognition， a constrained bi-channel model is proposed to fully exploit the emotional information contained in speech features from both global and local aspects， thereby improving the emotion recognition rate. In channel 1， the gated recurrent unit(GRU) was introduced and improved to capture the global information of speech features， and a BAGRU (bidirectional attention gate recurrent unit) model was constructed to improve the correlation between speech features. In channel 2， a convolutional neural network was employed to capture the local information of speech features and adversarial training was added to avoid mutual interference of local information. The bi-channel fusion model automatically generates different weights on the importance of channel features， and the orthogonal constraint is introduced to address the problem of feature redundancy in the bi-channel fusion. Experimental results show that the proposed model achieves recognition accuracies of 62.83% and 82.19% on two common emotional corpus， namely IEMOCAP and EMO-DB. The constrained bi-channel model has better performance in speech emotion recognition tasks.

Key words: speech emotion recognition; gated recurrent unit(GRU); convolutional neural network; orthogonal constraint

CLC Number:

TN912

SUN Ying， LI Ze， ZHANG Xue-ying. Speech Emotion Recognition Based on Constrained Bi-channel Model[J]. Journal of Northeastern University(Natural Science), 2023, 44(11): 1537-1542.

References

[1]Issa D，Demirci M F，Yazici A.Speech emotion recognition with deep convolutional neural networks［J］.Biomedical Signal Processing and Control，2020，59:101894.
[2]段俊毅，赵建峰.基于CNN的时频域语音情感识别的分析与对比［J］.内蒙古师范大学学报(自然科学汉文版)，2021，50(6):526-532.(Duan Jun-yi，Zhao Jian-feng.Analysis and comparison of speech emotion recognition in time-frequency domain based on CNN［J］.Journal of Inner Mongolia Normal University(Natural Science Edition)，2021，50(6):526-532.)
[3]Tzinis E，Potamianos A.Segment-based speech emotion recognition using recurrent neural networks［C］//2017 Seventh International Conference on Affective Computing and Intelligent Interaction(ACII).New York:IEEE，2017:190-195.
[4]焦亚萌，周成智，李文萍，等.融合多头注意力的VGGNet语音情感识别研究［J］.国外电子测量技术，2022，41(1):63-69.(Jiao Ya-meng，Zhou Cheng-zhi，Li Wen-ping，et al.Study on voice emotional recognition with multi-headed attention in VGGNet［J］.Foreigh Electronic Measurement Technology，2022，41(1):63-69.)
[5]Xie Y，Liang R Y，Liang Z L，et al.Speech emotion classification using attention-based LSTM［J］.IEEE/ACM Transactions on Audio，Speech，and Language Processing，2019，27(11):1675-1685.
[6]郑艳，陈家楠，吴凡，等.基于CGRU模型的语音情感识别研究与实现［J］.东北大学学报(自然科学版)，2020，41(12):1680-1685.(Zheng Yan，Chen Jia-nan，Wu Fan，et al.Research and implementation of speech emotion recognition based on CGRU model［J］.Journal of Northeastern University(Natural Science)，2020，41(12):1680-1685.)
[7]Petridis S，Stafylakis T，Ma P C，et al.End-to-end audiovisual speech recognition［C］//2018 IEEE International Conference on Acoustics，Speech and Signal Processing(ICASSP).New York:IEEE，2018:6548-6552.
[8]Vaswani A，Shazeer N，Parmar N，et al.Attention is all you need ［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems.New York:Curran Associates，2017:6000-6010.
[9]Creswell A，White T，Dumoulin V，et al.Generative adversarial networks:an overview［J］.IEEE Signal Processing Magazine，2018，35(1):53-65.
[10]Busso C，Bulut M，Lee C C，et al.IEMOCAP:interactive emotional dyadic motion capture database［J］.Language Resources and Evaluation，2008，42(4):335-359.
[11]Xiong R B，Yang Y C，He D，et al.On layer normalization in the transformer architecture ［EB/OL］.(2020-06-09)［2021-11-26］.https://arxiv.org/abs/2002.04745.
[12]Bousmalis K，Trigeorgis G，SilbermannI N，et al.Domain separation networks［C］// Proceedings of the 30th International Conference on Neural Information Processing Systems.New York:ACM，2016:343-351.