OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

Journal of Image and Signal Processing 2024

基于时空Transformer的端到端的视频注视目标检测
End-to-End Video Gaze Target Detection with Spatial-Temporal Transformers

DOI: 10.12677/jisp.2024.132017, PP. 190-209

彭梦昊, 王冠, 徐浩, 景圣恩

Keywords: 注视目标检测，Transformer，可变形注意力，时序变化建模
Gaze Target Detection, Transformer, Deformable Attention, Temporal Variation Modeling

Full-Text Cite this paper Add to My Lib

Abstract:

注视目标检测旨在定位人的注视目标。HGTTR的提出，将Transformer结构用于注视目标检测的任务中，解决了卷积神经网络需要额外的头部探测器的问题，实现了端到端的对头部位置和注视目标的同时检测，并且实现了优于传统的卷积神经网络的性能。然而，目前的方法在视频数据集上的性能还有较大提升空间。原因在于，当前的方法侧重于在单个视频帧中学习人的注视目标，没有对视频中的时间变化进行建模，所以无法解决动态注视、镜头失焦、运动模糊等问题。当一个人的注视目标在不断的发生变化时，缺乏时间变化建模可能会导致定位注视目标偏离人的真实注视目标。并且由于缺乏对于时间维度上的建模，模型无法解决因为镜头失焦和运动模糊等问题所导致的特征缺失。在这项工作当中，我们提出了一种基于时空Transformer的端到端的视频注视目标检测模型。首先，我们提出帧间局部可变形注意力机制，用于处理特征缺失的问题。其次，我们在可变形注意力机制的基础上，提出帧间可变形注意力机制，利用相邻视频帧的时序差异，动态选择采样点，从而实现对于动态注视的建模。最后，我们提出了时序Transformer来聚合由当前帧和参考帧的注视关系查询向量和注视关系特征。我们的时序Transformer包含三个部分：用于编码多帧空间信息的时序注视关系特征编码器，用于融合注视关系查询的时序注视关系查询编码器以及用于获取当前帧检测结果的时序注视关系解码器。通过对于单个帧空间、相邻帧间以及帧序列三个维度的时空建模，很好的解决了视频数据中常见的动态注视、镜头失焦、运动模糊等问题。大量实验证明，我们的方法在VideoAttentionTarget和VideoCoAtt两个数据集上均取得了较为优异的性能。
Gaze target detection is designed to locate the human gaze target. Proposed by HGTTR, Transformer structure is used in the task of gaze target detection, which solves the problem that convolutional neural networks need additional head detectors, realizes the end-to-end simultaneous detection of head position and gaze target, and achieves better performance than traditional convolutional neural networks. However, there is still much room for improvement in the performance of current methods on video data sets. The reason is that the current method focuses on learning the human gaze target in a single video frame, and does not model the time change in the video, so it cannot solve the problems of dynamic gaze, out-of-focus lens, and motion blur. When a person’s gaze target is constantly changing, the lack of time change modeling may cause the fixed gaze target to deviate from the person’s real gaze target. In addition, due to the lack of modeling in the time dimension, the model cannot solve the feature loss caused by out-of-focus lens and motion blur. In this work, we propose an end-to-end video gaze target detection model based on spatial-temporal Transformers. First, we propose an interframe local deformable attention mechanism to deal with feature missing problems. Secondly, on the basis of the deformable attention mechanism, we propose the Inter-frames deformable attention mechanism, which uses the timing difference of adjacent video frames to dynamically select sampling points, so as to realize the modeling of dynamic gaze. Finally, we propose a temporal Transformers to aggregate gaze relation query vectors and gaze relation features from the current frame and reference frame. Our temporal Transformers consists of three parts: A temporal gaze feature encoder for encoding multi-frame spatial information, a temporal gaze query encoder for

References

[1]	Judd, T., Ehinger, K., Durand, F., et al. (2009) Learning to Predict Where Humans Look. 2009 IEEE 12th International Conference on Computer Vision, Kyoto, 29 September-02 October 2009, 2106-2113. https://doi.org/10.1109/ICCV.2009.5459462
[2]	Recasens, A., Khosla, A., Vondrick, C., et al. (2015) Where Are They Looking? Advances in Neural Information Processing Systems, 28, 199-207.
[3]	Chong, E., Ruiz, N., Wang, Y., et al. (2018) Connecting Gaze, Scene, and Attention: Generalized Attention Estimation via Joint Modeling of Gaze and Scene Saliency. In: Ferrari, V., Hebert, M., Sminchisescu, C. and Weiss, Y., Eds., Computer Vision—ECCV 2018, Lecture Notes in Computer Science, Vol. 11209, Springer, Cham, 383-398. https://doi.org/10.1007/978-3-030-01228-1_24
[4]	Bao, J., Liu, B. and Yu, J. (2022) Escnet: Gaze Target Detection with the Understanding of 3d Scenes. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, 18-24 June 2022, 14126-14135. https://doi.org/10.1109/CVPR52688.2022.01373
[5]	Chong, E., Wang, Y., Ruiz, N., et al. (2020) Detecting Attended Visual Targets in Video. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 13-19 June 2020, 5396-5406. https://doi.org/10.1109/CVPR42600.2020.00544
[6]	Lian, D., Yu, Z. and Gao, S. (2018) Believe It or Not, We Know What You Are Looking at! In: Jawahar, C., Li, H., Mori, G. and Schindler, K., Eds., Computer Vision—ACCV 2018, Lecture Notes in Computer Science, Vol. 11363, Springer, Cham, 35-50. https://doi.org/10.1007/978-3-030-20893-6_3
[7]	Recasens, A., Vondrick, C., Khosla, A., et al. (2017) Following Gaze in Video. Proceedings of the IEEE International Conference on Computer Vision, Venice, 22-29 October 2017, 1435-1443. https://doi.org/10.1109/ICCV.2017.160
[8]	Fan, L., Chen, Y., Wei, P., et al. (2018) Inferring Shared Attention in Social Scene Videos. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 6460-6468. https://doi.org/10.1109/CVPR.2018.00676
[9]	Zhou, Q., Li, X., He, L., et al. (2022) TransVOD: End-to-End Video Object Detection with Spatial-Temporal Transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 7853-7869. https://doi.org/10.1109/TPAMI.2022.3223955
[10]	Dai, J., Qi, H., Xiong, Y., et al. (2017) Deformable Convolutional Networks. Proceedings of the IEEE International Conference on Computer Vision, Venice, 22-29 October 2017, 764-773. https://doi.org/10.1109/iccv.2017.89
[11]	Miao, Q., Hoai, M. and Samaras, D. (2023) Patch-Level Gaze Distribution Prediction for Gaze Following. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, 2-7 January 2023, 880-889. https://doi.org/10.1109/WACV56688.2023.00094
[12]	Fang, Y., Tang, J., Shen, W., et al. (2021) Dual Attention Guided Gaze Target Detection in the Wild. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, 20-25 June 2021, 11390-11399. https://doi.org/10.1109/CVPR46437.2021.01123
[13]	Jin, T., Yu, Q., Zhu, S., et al. (2022) Depth-Aware Gaze-Following via Auxiliary Networks for Robotics. Engineering Applications of Artificial Intelligence, 113, Article 104924. https://doi.org/10.1016/j.engappai.2022.104924
[14]	Tu, D., Min, X., Duan, H., et al. (2022) End-to-End Human-Gaze-Target Detection with Transformers. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 2192-2200. https://doi.org/10.1109/CVPR52688.2022.00224
[15]	Tonini, F., Dall’Asen, N., Beyan, C., et al. (2023) Object-Aware Gaze Target Detection. Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, 1-6 October 2023, 21860-21869. https://doi.org/10.1109/ICCV51070.2023.01998
[16]	Tonini, F., Beyan, C. and Ricci, E. (2022) Multimodal across Domains Gaze Target Detection. Proceedings of the 2022 International Conference on Multimodal Interaction, Bengaluru, 7-11 November 2022, 420-431. https://doi.org/10.1145/3536221.3556624
[17]	Long, F., Qiu, Z., Pan, Y., et al. (2022) Stand-Alone Inter-Frame Attention in Video Models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 3192-3201. https://doi.org/10.1109/CVPR52688.2022.00319
[18]	Zhu, X., Su, W., Lu, L., et al. (2020) Deformable DETR: Deformable Transformers for End-to-End Object Detection.
[19]	Saran, A., Majumdar, S., Short, E.S., et al. (2018) Human Gaze Following for Human-Robot Interaction. 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, 1-5 October 2018, 8615-8621. https://doi.org/10.1109/IROS.2018.8593580
[20]	Vaswani, A., Shazeer, N., Parmar, N., et al. (2017) Attention Is All You Need. Advances in Neural Information Processing Systems, 30, 5998-6008.
[21]	田永林, 王雨桐, 王建功, 等. 视觉 Transformer 研究的关键问题: 现状及展望[J]. 自动化学报, 2022, 48(4): 957-979.
[22]	Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020) An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale.
[23]	Carion, N., Massa, F., Synnaeve, G., et al. (2020) End-to-End Object Detection with Transformers. In: Vedaldi, A., Bischof, H., Brox, T. and Frahm, J.M., Eds., Computer Vision—ECCV 2020, Lecture Notes in Computer Science, Vol. 12346, Springer, Cham, 213-229. https://doi.org/10.1007/978-3-030-58452-8_13
[24]	Cheng, Y. and Lu, F. (2022) Gaze Estimation Using Transformer. 2022 26th International Conference on Pattern Recognition (ICPR), Montreal, 21-25 August 2022, 3341-3347. https://doi.org/10.1109/ICPR56361.2022.9956687
[25]	He, K., Zhang, X., Ren, S., et al. (2016) Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition, Las Vegas, 27-30 June 2016, 770-778. https://doi.org/10.1109/CVPR.2016.90
[26]	Glorot, X. and Bengio, Y. (2010) Understanding the Difficulty of Training Deep Feedforward Neural Networks. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 9, 249-256.
[27]	Pan, J., Sayrol, E., Giro-i-Nieto, X., et al. (2016) Shallow and Deep Convolutional Networks for Saliency Prediction. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 27-30 June 2016, 598-606. https://doi.org/10.1109/CVPR.2016.71

Full-Text

comments powered by Disqus

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133

WeChat 1538708413

基于时空Transformer的端到端的视频注视目标检测End-to-End Video Gaze Target Detection with Spatial-Temporal Transformers

基于时空Transformer的端到端的视频注视目标检测
End-to-End Video Gaze Target Detection with Spatial-Temporal Transformers