基于跨模态注意力机制的视频-文本检索方法

doi:10.12068/j.issn.1005-3026.2026.20250040

摘要/Abstract

摘要：

针对当前视频-文本检索方法未能有效结合时间信息与相关性信息进行联合建模的问题，提出一种基于跨模态注意力机制的视频-文本检索方法.首先，利用预训练的大规模图像-文本模型提取文本和视频帧的嵌入表示，通过知识迁移缓解不同模态数据之间的异质性问题.然后，使用联合文本-帧跨模态注意力机制模块，同时编码视频帧之间的时间信息以及视频帧与文本之间的相关性信息，捕获更具竞争力的视频特征表示.最后，利用交叉熵损失函数约束模型训练.通过对比实验验证，该方法能够有效捕获视频帧的时间信息和相关性信息，在MSR-VTT(microsoft research video to text)和LSMDC(large-scale movie description challenge)数据集上取得具有竞争力的效果.

关键词: 视频-文本检索, 跨模态, 注意力机制, 知识迁移, 视频特征表示

Abstract:

Existing video-text retrieval methods fail to effectively model temporal information and relevance information in a unified manner.To address this issue， a video-text retrieval method based on a cross-modal attention mechanism was proposed.Firstly， embeddings of video frames and text were extracted using a large-scale pre-trained image-text model， and knowledge transfer was leveraged to alleviate the heterogeneity between different modalities.Then， a joint text-frame cross-modal attention module was introduced to simultaneously encode temporal information among video frames and relevance information between video frames and text， enabling the capture of more competitive video representations.Finally， the cross-entropy loss function was used to constrain the model training.Comparative experiments for verification demonstrate that the proposed method can effectively capture temporal and relevance information of video frames， achieving competitive performance on the microsoft research video to text （MSR-VTT） and large-scale movie description challenge （LSMDC） datasets.

Key words: video-text retrieval, cross-modal, attention mechanism, knowledge transfer, video representation

中图分类号:

TP 391

董闯, 栗伟, 巴聪, 覃文军. 基于跨模态注意力机制的视频-文本检索方法[J]. 东北大学学报（自然科学版）, 2026, 47(1): 75-81.

Chuang DONG, Wei LI, Cong BA, Wen-jun TAN. Video-Text Retrieval Method Based on Cross-Modal Attention Mechanism[J]. Journal of Northeastern University(Natural Science), 2026, 47(1): 75-81.

图/表 9

参考文献 30

[1]	Radford A， Kim J W， Hallacy C， et al.Learning transferable visual models from natural language supervision［C］// Proceedings of the 38th International Conference on Machine Learning.Vienna： PMLR， 2021： 8748-8763.
[2]	Luo H S， Ji L， Zhong M， et al.CLIP4Clip： an empirical study of CLIP for end to end video clip retrieval and captioning［J］.Neurocomputing， 2022， 508： 293-304.
[3]	Ma Y W， Xu G H， Sun X S， et al.X-CLIP： end-to-end multi-grained contrastive learning for video-text retrieval［C］// Proceedings of the 30th ACM International Conference on Multimedia.New York： Association for Computing Machinery， 2022： 638-647.
[4]	Wu W H， Luo H P， Fang B， et al.Cap4Video： what can auxiliary captions do for text-video retrieval？［C］// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Vancouver BC： IEEE， 2023： 10704-10713.
[5]	Fang H， Xiong P F， Xu L H， et al.Transferring image-CLIP to video-text retrieval via temporal relations［J］.IEEE Transactions on Multimedia， 2023， 25： 7772-7785.
[6]	Bertasius G， Wang H， Torresani L.Is space-time attention all you need for video understanding？［C］//Proceedings of the 38th International Conference on Machine Learning.Vienna： PMLR， 2021： 813-824.
[7]	Liu R Y， Huang J J， Li G， et al.Revisiting temporal modeling for CLIP-based image-to-video knowledge transferring［C］// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Vancouver BC： IEEE， 2023： 6555-6564.
[8]	Xu J， Mei T， Yao T， et al.MSR-VTT： a large video description dataset for bridging video and language［C］// 2016 IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas： IEEE， 2016： 5288-5296.
[9]	Gorti S K， Vouitsis N， Ma J W， et al. X-Pool： cross-modal language-video attention for text-video retrieval［C］// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New Orleans： IEEE， 2022： 5006-5015.
[10]	Miech A， Laptev I， Sivic J.Learning a text-video embedding from incomplete and heterogeneous data ［EB/OL］.（2018-04-07）［2024-10-24］..
[11]	Gabeur V， Sun C， Alahari K， et al. Multi-modal Transformer for video retrieval［C］// Computer Vision-ECCV 2020： 16th European Conference.Glasgow： Springer International Publishing， 2020： 214-229.
[12]	Liu Y， Albanie S， Nagrani A， et al.Use what you have： video retrieval using representations from collaborative experts ［EB/OL］.（2019-07-31）［2024-10-24］..
[13]	Jordan M I， Jacobs R A.Hierarchical mixtures of experts and the EM algorithm［J］.Neural Computation， 1994， 6（2）： 181-214.
[14]	Miech A， Zhukov D， Alayrac J B， et al.HowTo100M： learning a text-video embedding by watching hundred million narrated video clips［C］// Proceedings of the IEEE/CVF International Conference on Computer Vision.Seoul： IEEE， 2019： 2630-2640.
[15]	Bain M， Nagrani A， Varol G， et al. Frozen in time： a joint video and image encoder for end-to-end retrieval［C］// Proceedings of the IEEE/CVF International Conference on Computer Vision.Montreal： IEEE， 2021： 1728-1738.
[16]	Zhu L C， Yang Y.ActBERT： learning global-local video-text representations［C］// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Seattle： IEEE， 2020： 8743-8752.
[17]	Hochreiter S， Schmidhuber J. Long short-term memory［J］.Neural Computation， 1997，9（8）： 1735-1780.
[18]	Vaswani A， Shazeer N， Parmar N， et al. Attention is all you need［C］//Proceedings of the 31st International Conference on Neural Information Processing Systems.Long Beach： Curran Associates Inc.， 2017： 6000-6010.
[19]	Buch S， Eyzaguirre C， Gaidon A， et al.Revisiting the “Video” in video-language understanding［C］//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans： IEEE， 2022： 2907-2917.
[20]	Rohrbach A， Torabi A， Rohrbach M， et al.Movie description［J］.International Journal of Computer Vision， 2017， 123（1）： 94-120.
[21]	Kingma D P， Ba J.Adam： a method for stochastic optimization［EB/OL］.（2017-01-30）［2024-10-24］..
[22]	Loshchilov I， Hutter F.SGDR： stochastic gradient descent with warm restarts［EB/OL］.（2017-05-30）［2024-10-24］..
[23]	Zhao S， Zhu L C， Wang X H， et al.CenterCLIP： token clustering for efficient text-video retrieval［C］//Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval.Madrid： ACM， 2022： 970-981.
[24]	Wang J P， Ge Y X， Yan R， et al.All in one： exploring unified video-language pre-training［C］// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）.Vancouver BC： IEEE， 2023： 6598-6608.
[25]	Shu F X， Chen B L， Liao Y， et al.MAC： masked contrastive pre-training for efficient video-text retrieval［J］.IEEE Transactions on Multimedia， 2024， 26： 9962-9972.
[26]	Huang J J， Li Y N， Feng J S， et al.Clover： towards a unified video-language alignment and fusion model［C］//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Vancouver BC： IEEE， 2023： 14856-14866.
[27]	Zhang H W， Yang Y， Qi F， et al.Robust video-text retrieval via noisy pair calibration［J］. IEEE Transactions on Multimedia， 2023， 25： 8632-8645.
[28]	Yang X， Zhu L， Wang X， et al. DGL： dynamic global-local prompt tuning for text-video retrieval［C］//Proceedings of the AAAI Conference on Artificial Intelligence.Vancouver： AAAI Press， 2024： 6540-6548.
[29]	Liu Y Q， Xiong P F， Xu L H， et al.TS2-net： token shift and selection transformer for text-video retrieval［C］// European Conference on Computer Vision.Cham： Springer， 2022： 319-335.
[30]	Chen Y Z， Wang J， Lin L J， et al.Tagging before alignment： integrating multi-modal tags for video-text retrieval［C］//Proceedings of the AAAI Conference on Artificial Intelligence. Washington DC： AAAI Press， 2023： 396-404.

模型	R@1	R@5	R@10	MdR	MnR
ActBERT^[16]	8.6	23.4	33.1	36.0	—
Howto100M^[14]	14.9	40.2	52.8	9.0	—
ClipBERT^[23]	22.0	46.8	59.9	6.0	—
All-in-one^[24]	34.4	65.4	75.8	—	—
CLIP4Clip^[2]	42.0	68.6	78.7	2.0	16.2
X-Pool^[9]	43.9	72.5	82.3	2.0	14.6
Ours	44.6	73.1	84.0	2.0	12.4

模型	R@1	R@5	R@10	MdR	MnR
ActBERT^[16]	8.6	23.4	33.1	36.0	—
Howto100M^[14]	14.9	40.2	52.8	9.0	—
ClipBERT^[23]	22.0	46.8	59.9	6.0	—
All-in-one^[24]	34.4	65.4	75.8	—	—
CLIP4Clip^[2]	42.0	68.6	78.7	2.0	16.2
X-Pool^[9]	43.9	72.5	82.3	2.0	14.6
Ours	44.6	73.1	84.0	2.0	12.4

模型	R@1	R@5	R@10	MdR	MnR
MMT^[11]	26.6	57.1	69.6	4.0	24.0
FROZEN^[15]	32.5	61.5	71.2	3.0	—
All-in-one^[24]	37.9	68.1	77.1	—	—
MAC^[25]	38.9	63.1	73.9	3.0	—
Clover^[26]	40.5	69.8	79.4	2.0	—
CLIP4Clip^[2]	44.5	71.4	81.6	2.0	15.3
RVTR^[27]	45.8	73.0	83.5	—	—
X-CLIP^[3]	46.1	73.0	83.1	2.0	13.2
X-Pool^[9]	46.9	72.8	82.2	2.0	14.3
STAN^[7]	46.9	72.8	82.8	2.0	—
DGL^[28]	47.0	70.4	81.0	—	16.4
TS2-Net^[29]	47.0	74.5	83.8	2.0	13.0
TABLE^[30]	47.1	74.3	82.9	2.0	13.4
Ours	47.2	73.1	84.3	2.0	11.3

模型	R@1	R@5	R@10	MdR	MnR
MMT^[11]	26.6	57.1	69.6	4.0	24.0
FROZEN^[15]	32.5	61.5	71.2	3.0	—
All-in-one^[24]	37.9	68.1	77.1	—	—
MAC^[25]	38.9	63.1	73.9	3.0	—
Clover^[26]	40.5	69.8	79.4	2.0	—
CLIP4Clip^[2]	44.5	71.4	81.6	2.0	15.3
RVTR^[27]	45.8	73.0	83.5	—	—
X-CLIP^[3]	46.1	73.0	83.1	2.0	13.2
X-Pool^[9]	46.9	72.8	82.2	2.0	14.3
STAN^[7]	46.9	72.8	82.8	2.0	—
DGL^[28]	47.0	70.4	81.0	—	16.4
TS2-Net^[29]	47.0	74.5	83.8	2.0	13.0
TABLE^[30]	47.1	74.3	82.9	2.0	13.4
Ours	47.2	73.1	84.3	2.0	11.3

模型	R@1	R@5	R@10	MdR	MnR
MMT^[11]	12.9	29.9	40.1	19.3	75.0
FROZEN^[15]	15.0	30.8	39.8	20.0	—
RVTR^[27]	19.2	38.0	47.0	—	—
DGL^[28]	21.6	39.3	49.0	—	64.4
CLIP4Clip^[2]	22.6	41.0	49.1	11.0	61.0
X-CLIP^[3]	23.3	43.0	56.0	—	—
TS2-Net^[29]	23.4	42.3	50.9	9.0	56.9
STAN^[7]	23.7	42.7	51.8	9.0	—
TABLE^[30]	24.3	44.9	53.7	8.0	52.7
Clover^[26]	24.8	44.0	54.5	8.0	—
X-Pool^[9]	25.2	43.7	53.5	8.0	53.2
Ours	25.4	43.8	54.2	8.0	52.8