东北大学学报(自然科学版) ›› 2023, Vol. 44 ›› Issue (11): 1537-1542.DOI: 10.12068/j.issn.1005-3026.2023.11.003

• 信息与控制 • 上一篇    下一篇

基于约束式双通道模型的语音情感识别

孙颖, 李泽, 张雪英   

  1. (太原理工大学 信息与计算机学院, 山西 太原030024)
  • 发布日期:2023-12-05
  • 通讯作者: 孙颖
  • 作者简介:孙颖(1981-),女,山西太原人,太原理工大学副教授; 张雪英(1964-),女,河北石家庄人,太原理工大学教授.
  • 基金资助:
    山西省自然科学基金资助项目(201901D111096); 山西省研究生教育创新项目(2021Y300).

Speech Emotion Recognition Based on Constrained Bi-channel Model

SUN Ying, LI Ze, ZHANG Xue-ying   

  1. College of Information and Computer, Taiyuan University of Technology, Taiyuan 030024, China.
  • Published:2023-12-05
  • Contact: ZHANG Xue-ying
  • About author:-
  • Supported by:
    -

摘要: 针对语音情感识别过程中特征不充分的问题,提出了约束式双通道模型,从全局和局部两方面充分挖掘特征所包含的情感信息,从而提高情感识别率.通道1是针对语音特征的全局信息,通过改进门控循环单元,构建了BAGRU(bidirectional attention gate recurrent unit)模型,提高了语音特征之间的相关性;通道2是针对语音特征的局部信息,卷积神经网络与对抗训练结合,避免了局部信息相互干扰.通过双通道融合模型,根据通道特征重要程度生成不同权重,同时引入正交约束,解决了融合时产生特征冗余的问题.研究结果表明,在IEMOCAP和EMO-DB情感语料库上分别达到了62.83%和82.19%的识别精度,表现出了良好性能.

关键词: 语音情感识别;门控循环单元;卷积神经网络;正交约束

Abstract: To address the problem of insufficient speech features in speech emotion recognition, a constrained bi-channel model is proposed to fully exploit the emotional information contained in speech features from both global and local aspects, thereby improving the emotion recognition rate. In channel 1, the gated recurrent unit(GRU) was introduced and improved to capture the global information of speech features, and a BAGRU (bidirectional attention gate recurrent unit) model was constructed to improve the correlation between speech features. In channel 2, a convolutional neural network was employed to capture the local information of speech features and adversarial training was added to avoid mutual interference of local information. The bi-channel fusion model automatically generates different weights on the importance of channel features, and the orthogonal constraint is introduced to address the problem of feature redundancy in the bi-channel fusion. Experimental results show that the proposed model achieves recognition accuracies of 62.83% and 82.19% on two common emotional corpus, namely IEMOCAP and EMO-DB. The constrained bi-channel model has better performance in speech emotion recognition tasks.

Key words: speech emotion recognition; gated recurrent unit(GRU); convolutional neural network; orthogonal constraint

中图分类号: