基于约束式双通道模型的语音情感识别

doi:10.12068/j.issn.1005-3026.2023.11.003

东北大学学报（自然科学版） ›› 2023, Vol. 44 ›› Issue (11): 1537-1542.DOI: 10.12068/j.issn.1005-3026.2023.11.003

基于约束式双通道模型的语音情感识别

孙颖，李泽，张雪英

(太原理工大学信息与计算机学院，山西太原030024)

发布日期:2023-12-05
通讯作者: 孙颖
作者简介:孙颖(1981-)，女，山西太原人，太原理工大学副教授；张雪英(1964-)，女，河北石家庄人，太原理工大学教授.
基金资助:
山西省自然科学基金资助项目(201901D111096); 山西省研究生教育创新项目(2021Y300).

Speech Emotion Recognition Based on Constrained Bi-channel Model

SUN Ying， LI Ze， ZHANG Xue-ying

College of Information and Computer， Taiyuan University of Technology， Taiyuan 030024， China.

Published:2023-12-05
Contact: ZHANG Xue-ying
About author:-
Supported by:
-

摘要/Abstract

摘要： 针对语音情感识别过程中特征不充分的问题，提出了约束式双通道模型，从全局和局部两方面充分挖掘特征所包含的情感信息，从而提高情感识别率.通道1是针对语音特征的全局信息，通过改进门控循环单元，构建了BAGRU(bidirectional attention gate recurrent unit)模型，提高了语音特征之间的相关性；通道2是针对语音特征的局部信息，卷积神经网络与对抗训练结合，避免了局部信息相互干扰.通过双通道融合模型，根据通道特征重要程度生成不同权重，同时引入正交约束，解决了融合时产生特征冗余的问题.研究结果表明，在IEMOCAP和EMO-DB情感语料库上分别达到了62.83%和82.19%的识别精度，表现出了良好性能.

关键词: 语音情感识别；门控循环单元；卷积神经网络；正交约束

Abstract: To address the problem of insufficient speech features in speech emotion recognition， a constrained bi-channel model is proposed to fully exploit the emotional information contained in speech features from both global and local aspects， thereby improving the emotion recognition rate. In channel 1， the gated recurrent unit(GRU) was introduced and improved to capture the global information of speech features， and a BAGRU (bidirectional attention gate recurrent unit) model was constructed to improve the correlation between speech features. In channel 2， a convolutional neural network was employed to capture the local information of speech features and adversarial training was added to avoid mutual interference of local information. The bi-channel fusion model automatically generates different weights on the importance of channel features， and the orthogonal constraint is introduced to address the problem of feature redundancy in the bi-channel fusion. Experimental results show that the proposed model achieves recognition accuracies of 62.83% and 82.19% on two common emotional corpus， namely IEMOCAP and EMO-DB. The constrained bi-channel model has better performance in speech emotion recognition tasks.

Key words: speech emotion recognition; gated recurrent unit(GRU); convolutional neural network; orthogonal constraint

中图分类号:

TN912

孙颖，李泽，张雪英. 基于约束式双通道模型的语音情感识别[J]. 东北大学学报（自然科学版）, 2023, 44(11): 1537-1542.

SUN Ying， LI Ze， ZHANG Xue-ying. Speech Emotion Recognition Based on Constrained Bi-channel Model[J]. Journal of Northeastern University(Natural Science), 2023, 44(11): 1537-1542.

参考文献

[1]Issa D，Demirci M F，Yazici A.Speech emotion recognition with deep convolutional neural networks［J］.Biomedical Signal Processing and Control，2020，59:101894.
[2]段俊毅，赵建峰.基于CNN的时频域语音情感识别的分析与对比［J］.内蒙古师范大学学报(自然科学汉文版)，2021，50(6):526-532.(Duan Jun-yi，Zhao Jian-feng.Analysis and comparison of speech emotion recognition in time-frequency domain based on CNN［J］.Journal of Inner Mongolia Normal University(Natural Science Edition)，2021，50(6):526-532.)
[3]Tzinis E，Potamianos A.Segment-based speech emotion recognition using recurrent neural networks［C］//2017 Seventh International Conference on Affective Computing and Intelligent Interaction(ACII).New York:IEEE，2017:190-195.
[4]焦亚萌，周成智，李文萍，等.融合多头注意力的VGGNet语音情感识别研究［J］.国外电子测量技术，2022，41(1):63-69.(Jiao Ya-meng，Zhou Cheng-zhi，Li Wen-ping，et al.Study on voice emotional recognition with multi-headed attention in VGGNet［J］.Foreigh Electronic Measurement Technology，2022，41(1):63-69.)
[5]Xie Y，Liang R Y，Liang Z L，et al.Speech emotion classification using attention-based LSTM［J］.IEEE/ACM Transactions on Audio，Speech，and Language Processing，2019，27(11):1675-1685.
[6]郑艳，陈家楠，吴凡，等.基于CGRU模型的语音情感识别研究与实现［J］.东北大学学报(自然科学版)，2020，41(12):1680-1685.(Zheng Yan，Chen Jia-nan，Wu Fan，et al.Research and implementation of speech emotion recognition based on CGRU model［J］.Journal of Northeastern University(Natural Science)，2020，41(12):1680-1685.)
[7]Petridis S，Stafylakis T，Ma P C，et al.End-to-end audiovisual speech recognition［C］//2018 IEEE International Conference on Acoustics，Speech and Signal Processing(ICASSP).New York:IEEE，2018:6548-6552.
[8]Vaswani A，Shazeer N，Parmar N，et al.Attention is all you need ［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems.New York:Curran Associates，2017:6000-6010.
[9]Creswell A，White T，Dumoulin V，et al.Generative adversarial networks:an overview［J］.IEEE Signal Processing Magazine，2018，35(1):53-65.
[10]Busso C，Bulut M，Lee C C，et al.IEMOCAP:interactive emotional dyadic motion capture database［J］.Language Resources and Evaluation，2008，42(4):335-359.
[11]Xiong R B，Yang Y C，He D，et al.On layer normalization in the transformer architecture ［EB/OL］.(2020-06-09)［2021-11-26］.https://arxiv.org/abs/2002.04745.
[12]Bousmalis K，Trigeorgis G，SilbermannI N，et al.Domain separation networks［C］// Proceedings of the 30th International Conference on Neural Information Processing Systems.New York:ACM，2016:343-351.

基于约束式双通道模型的语音情感识别

Speech Emotion Recognition Based on Constrained Bi-channel Model

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 3

编辑推荐

Metrics

本文评价

[1]	郑艳，陈家楠，吴凡，付彬. 基于CGRU模型的语音情感识别研究与实现[J]. 东北大学学报:自然科学版, 2020, 41(12): 1680-1685.
[2]	郑艳，高爽. 基于自适应门限的分形维数语音端点检测[J]. 东北大学学报:自然科学版, 2020, 41(1): 7-11.
[3]	郑艳，姜源祥. 基于特征融合的说话人聚类算法[J]. 东北大学学报（自然科学版）, 2021, 42(7): 952-959.