基于深度学习的视频描述方法研究综述An overview of video captioning method base on deep learning
常志;赵德新;
摘要(Abstract):
随着深度学习技术在计算机视觉领域与自然语言处理领域的突破性进展,图像描述和视频描述的跨模态研究不断涌现.由于视频的时序特征以及视频内容的多样性与复杂性,视频描述相对于图像描述来说具有更大的挑战.视频描述的方法可以归纳为两类:基于模板的方法和基于编码-解码的方法.本文将着重介绍采用深度学习技术的编码-解码方法,文章首先对模型结构的发展做了分析与比较,其次对现有的方法做了归纳与总结.接着,介绍了一些比较有影响力的数据集和评测标准,最后对尚未解决的关键问题与研究难点做了总结与介绍.
关键词(KeyWords): 深度学习;视频描述;编码-解码
基金项目(Foundation): 国家自然科学基金(61202169)
作者(Author): 常志;赵德新;
Email:
DOI:
参考文献(References):
- [1] Venugopalan S,Xu H,Donahue J,et al. Translating videos to natural language using deep recurrent neural networks[J]. Computer Science,2015,12:1494-1504.
- [2] Venugopalan S,Rohrbach M,Donahue J,et al. Sequence to sequence-video to text[C]//International Conference on Computer Vision. Colorado:ICCV,2015:4534-4542.
- [3] Yao L,Torabi A,Cho K,et al. Describing videos by exploiting temporal structure[C]//International Conference on Computer Vision. Santiago:IEEE,2015:4507-4515.
- [4] Pan P,Xu Z,Yang Y,et al. Hierarchical recurrent neural encoder for video representation with application to captioning[C]//Computer Vision and Pattern Recognition. Las Vegas:IEEE,2016:1029-1038.
- [5] Baraldi L,Grana C,Cucchiara R,et al. Hierarchical boundary-aware neural encoder for video captioning[C]//Computer Vision and Pattern Recognition. Honolulu:IEEE,2017:3185-3194.
- [6] Zhu L,Xu Z,Yang Y,et al. Bidirectional multirate reconstruction for temporal modeling in videos[C]//Computer Vision and Pattern Recognition. Honolulu:IEEE, 2017:1339-1348.
- [7] Zhang J,Peng Y. Object-aware aggregation with bidirectional temporal graph for video captioning.[C]//Computer Vision and Pattern Recognition. USA:IEEE,2019:8319-8329.
- [8] Zanfir M,Marinoiu E,Sminchisescu C,et al. Spatio-temporal attention models for grounded video captioning[C]//Asian Conference on Computer Vision. Taiwan:Springer,2016:104-119.
- [9] Yu Y,Ko H,Choi J,et al. Video captioning and retrieval models with semantic attention[J]. Computer Vision and Pattern Recognition,2016,47:1610-1621.
- [10] Yu Y,Ko H,Choi J,et al. End-to-end concept word detection for video captioning,retrieval,and question answering[C]//Computer Vision and Pattern Recognition. Honolulu:IEEE,2016,347:3261-3269.
- [11] Guo Z,Gao L,Song J,et al. Attention-based lstm with semantic consistency for videos captioning[C]//Acm Multimedia. Amsterdam:ACM,2016:357-361.
- [12] Wang B,Ma L,Zhang W,et al. Reconstruction network for video captioning[C]//Computer Vision and Pattern Recognition. Salt Lake City:IEEE,2018:7622-7631.
- [13] Wang J,Wang W,Huang Y,et al. M3:Multimodal memory modelling for video captioning[C]//Computer Vision and Pattern Recognition. Salt Lake City:IEEE,2018:7512-7520.
- [14] Jin Q,Chen J,Chen S,et al. Describing videos using multimodal fusion[C]//Acm Multimedia. Amsterdam:ACM,2016:1087-1091.
- [15] Torabi A,Pal C,Larochelle H,et al. Using descriptive video services to create a large data source for video annotation research[J]. Computer Vision and Pattern Recognition,2015,147:1503-1510.
- [16] Rohrbach A,Rohrbach M,Tandon N,et al. A dataset for movie description[C]//Computer Vision and Pattern Recognition. Boston:IEEE,2015:3202-3212.
- [17] Chen D L,Dolan W B. Collecting highly parallel data for paraphrase evaluation[C]/Meeting of the Association for Computational Linguistics. USA:ACM,2011:190-200.
- [18] Xu J,Mei T,Yao T,et al. Msr-vtt:a large video description dataset for bridging video and language[C]//Computer Vision and Pattern Recognition. Las Vegas:IEEE,2016:5288-5296.
- [19] Wang X,Wu J,Chen J,et al. Vatex:a large-scale,highquality multilingual dataset for video-and-language research[J]. Computer Vision and Pattern Recognition,2019,3:4581-4591.
- [20] Papineni K,Roukos S,Ward T,et al. Bleu:a method for automatic evaluation of machine translation[C]//Meeting of the Association for Computational Linguistics. USA:ACL,2002:311-318.
- [21] Elliott D,Keller F. Image description using visual dependency representations[C]/Empirical Methods in Natural Language Processing. USA:ACL,2013:1292-1302.
- [22] Lin C. Rouge:a package for automatic evaluation of summaries[C]//Meeting of the Association for Computational Linguistics. Spain:ACL,2004:74-81.
- [23] Vedantam R,Zitnick C L,Parikh D,et al. Cider:consensus-based image description evaluation[C]//Computer Vision and Pattern Recognition. Boston:IEEE,2015:4566-4575.