东北大学学报(自然科学版) ›› 2026, Vol. 47 ›› Issue (1): 75-81.DOI: 10.12068/j.issn.1005-3026.2026.20250040

• 信息与控制 • 上一篇    下一篇

基于跨模态注意力机制的视频-文本检索方法

董闯1, 栗伟1,2(), 巴聪1, 覃文军1,2   

  1. 1.东北大学 计算机科学与工程学院,辽宁 沈阳 110819
    2.东北大学 工业智能与系统优化国家级前沿科学中心,辽宁 沈阳 110819
  • 收稿日期:2025-04-22 出版日期:2026-01-15 发布日期:2026-03-17
  • 通讯作者: 栗伟
  • 作者简介:董 闯(1994—),男,辽宁本溪人,东北大学博士研究生.
  • 基金资助:
    高等学校学科创新引智计划项目(B16009)

Video-Text Retrieval Method Based on Cross-Modal Attention Mechanism

Chuang DONG1, Wei LI1,2(), Cong BA1, Wen-jun TAN1,2   

  1. 1.School of Computer Science & Engineering,Northeastern University,Shenyang 110819,China
    2.National Frontiers Science Center for Industrial Intelligence and Systems Optimization,Northeastern University,Shenyang 110819,China. cn
  • Received:2025-04-22 Online:2026-01-15 Published:2026-03-17
  • Contact: Wei LI

摘要:

针对当前视频-文本检索方法未能有效结合时间信息与相关性信息进行联合建模的问题,提出一种基于跨模态注意力机制的视频-文本检索方法.首先,利用预训练的大规模图像-文本模型提取文本和视频帧的嵌入表示,通过知识迁移缓解不同模态数据之间的异质性问题.然后,使用联合文本-帧跨模态注意力机制模块,同时编码视频帧之间的时间信息以及视频帧与文本之间的相关性信息,捕获更具竞争力的视频特征表示.最后,利用交叉熵损失函数约束模型训练.通过对比实验验证,该方法能够有效捕获视频帧的时间信息和相关性信息,在MSR-VTT(microsoft research video to text)和LSMDC(large-scale movie description challenge)数据集上取得具有竞争力的效果.

关键词: 视频-文本检索, 跨模态, 注意力机制, 知识迁移, 视频特征表示

Abstract:

Existing video-text retrieval methods fail to effectively model temporal information and relevance information in a unified manner.To address this issue, a video-text retrieval method based on a cross-modal attention mechanism was proposed.Firstly, embeddings of video frames and text were extracted using a large-scale pre-trained image-text model, and knowledge transfer was leveraged to alleviate the heterogeneity between different modalities.Then, a joint text-frame cross-modal attention module was introduced to simultaneously encode temporal information among video frames and relevance information between video frames and text, enabling the capture of more competitive video representations.Finally, the cross-entropy loss function was used to constrain the model training.Comparative experiments for verification demonstrate that the proposed method can effectively capture temporal and relevance information of video frames, achieving competitive performance on the microsoft research video to text (MSR-VTT) and large-scale movie description challenge (LSMDC) datasets.

Key words: video-text retrieval, cross-modal, attention mechanism, knowledge transfer, video representation

中图分类号: