东北大学学报(自然科学版) ›› 2026, Vol. 47 ›› Issue (1): 42-51.DOI: 10.12068/j.issn.1005-3026.2026.20250067

• 智慧医疗专栏 • 上一篇    下一篇

基于面部视频的非接触式血氧饱和度估计方法

齐林1,2,3, 高启赫1, 关舒月1, 李永春4()   

  1. 1.东北大学 医学与生物信息工程学院,辽宁 沈阳 110169
    2.东北大学 医学影像计算教育部重点实验室,辽宁 沈阳 110169
    3.东北大学 医学成像与智能分析教育部工程研究中心,辽宁 沈阳 110169
    4.沈阳康泰电子科技股份有限公司,辽宁 沈阳 110167
  • 收稿日期:2025-06-12 出版日期:2026-01-15 发布日期:2026-03-17
  • 通讯作者: 李永春
  • 作者简介:齐 林(1981—),男,吉林长春人,东北大学副教授.
  • 基金资助:
    辽宁省重点研发项目(2024JH2/102500076)

Non-contact Estimation Method of Blood Oxygen Saturation Based on Facial Videos

Lin QI1,2,3, Qi-he GAO1, Shu-yue GUAN1, Yong-chun LI4()   

  1. 1.College of Medicine and Biological Information Engineering,Northeastern University,Shenyang 110169,China
    2.Key Laboratory of Medical Image Computing,Ministry of Education,Northeastern University,Shenyang 110169,China
    3.Engineering Research Center of Medical Imaging and Intelligent Analysis,Ministry of Education,Northeastern University,Shenyang 110169,China
    4.Shenyang Contain Electronic Technology Co. ,Ltd. ,Shenyang 110167,China.
  • Received:2025-06-12 Online:2026-01-15 Published:2026-03-17
  • Contact: Yong-chun LI

摘要:

针对远程光电容积描记法(rPPG)在非接触式血氧饱和度(SpO2)测量中存在的时空特征建模不足以及复杂场景下鲁棒性差的挑战,提出了一种趋势感知时空融合网络(trend-aware spatio-temporal fusion network, TAST-Net).该网络通过一个创新的双路融合架构,将3D卷积神经网络(3D CNN)分支提取的局部生理特征与ViViT(video vision transformer)分支捕捉的全局时空依赖进行协同融合.为增强模型对信号动态变化的敏感性,设计了一种结合均方误差与皮尔逊相关性损失的加权组合损失函数.在2个公开数据集上的实验结果表明,TAST-Net表现出优秀的性能:在PURE(pulse rate estimation)数据集上均方根误差(eRMS)为0.53%,平均绝对误差(eMA)为0.37%,皮尔逊相关系数(R)为0.96;在更具挑战性的VIPL-HR(visual information processing and learning-heart rate)数据集上,eRMS为0.84%,eMA为0.57%,R为0.82,其综合性能优于其他对比方法.研究结果表明,TAST-Net为从面部视频中实现准确、稳健的SpO2估计提供了一个有效的方案,并验证了融合局部与全局特征策略在rPPG信号处理中的有效性.

关键词: 远程光电容积描记法, 深度学习, 非接触, 血氧饱和度估计, 面部视频

Abstract:

To address the challenges of inadequate spatio-temporal feature modeling and poor robustness in complex scenarios for non-contact blood oxygen saturation (SpO2) measurement using remote photoplethysmography (rPPG),a trend-aware spatio-temporal fusion network (TAST-Net) was proposed. The proposed network adopted an innovative dual-branch fusion architecture that synergistically fused local physiological features extracted by a 3D convolutional neural network (3D CNN) branch with global spatio-temporal dependencies captured by a video vision transformer (ViViT) branch. To enhance the model’s sensitivity to signal dynamics, a weighted composite loss function combining mean squared error (MSE) and Pearson correlation loss was designed. Experimental results on two public datasets demonstrate the superior performance of TAST-Net. On the pulse rate estimation (PURE) dataset, it achieves a root mean squared error (eRMS) of 0.53%, a mean absolute error (eMA) of 0.37%, and a Pearson correlation coefficient (R) of 0.96. On the more challenging visual information processing and learning-heart rate (VIPL-HR) dataset, the eRMSeMA, and R reach 0.84%, 0.57%, and 0.82, respectively, outperforming other comparative methods. These findings indicate that TAST-Net provides an effective solution for accurate and robust SpO2 estimation from facial videos and validates the advantage of integrating local and global features in rPPG signal processing.

Key words: remote photoplethysmography, deep learning, non-contact, blood oxygen saturation estimation, facial video

中图分类号: