Journal of Northeastern University(Natural Science) ›› 2026, Vol. 47 ›› Issue (1): 75-81.DOI: 10.12068/j.issn.1005-3026.2026.20250040

• Information & Control • Previous Articles     Next Articles

Video-Text Retrieval Method Based on Cross-Modal Attention Mechanism

Chuang DONG1, Wei LI1,2(), Cong BA1, Wen-jun TAN1,2   

  1. 1.School of Computer Science & Engineering,Northeastern University,Shenyang 110819,China
    2.National Frontiers Science Center for Industrial Intelligence and Systems Optimization,Northeastern University,Shenyang 110819,China. cn
  • Received:2025-04-22 Online:2026-01-15 Published:2026-03-17
  • Contact: Wei LI

Abstract:

Existing video-text retrieval methods fail to effectively model temporal information and relevance information in a unified manner.To address this issue, a video-text retrieval method based on a cross-modal attention mechanism was proposed.Firstly, embeddings of video frames and text were extracted using a large-scale pre-trained image-text model, and knowledge transfer was leveraged to alleviate the heterogeneity between different modalities.Then, a joint text-frame cross-modal attention module was introduced to simultaneously encode temporal information among video frames and relevance information between video frames and text, enabling the capture of more competitive video representations.Finally, the cross-entropy loss function was used to constrain the model training.Comparative experiments for verification demonstrate that the proposed method can effectively capture temporal and relevance information of video frames, achieving competitive performance on the microsoft research video to text (MSR-VTT) and large-scale movie description challenge (LSMDC) datasets.

Key words: video-text retrieval, cross-modal, attention mechanism, knowledge transfer, video representation

CLC Number: