%0 Journal Article
%T 基于时空采样的视频行为识别<br>Video Action Recognition Based on Spatiotemporal Sampling
%A 王冠
%A 彭梦昊
%A 陶应诚
%A 徐浩
%A 景圣恩
%J Artificial Intelligence and Robotics Research
%P 300-312
%@ 2326-3423
%D 2024
%I Hans Publishing
%R 10.12677/airr.2024.132032
%X 视频特征包含了行为执行时的时间、空间冗余信息。该信息和行为类别无关，会干扰行为识别，造成行为类别的错误判断。本文提出了一种基于时空采样的视频行为识别模型。模型包括关键帧采样和Token采样的视频Transformer。关键帧采样过程，通过量化相邻帧间的像素差异，识别出包含显著变化的关键帧，累积多个连续帧的更新概率处理两个关键帧间的可能存在的长时间间隔，引入一个可训练的采样概率阈值从而将更新概率二值化，增强对于关键帧的建模能力。因此该过程保证了视频关键信息的获取。本文认为不同的Token对识别任务的重要性会有所不同，因此在时空Transformer块中，本文采用一种数据依赖的Token采样策略，通过分层减少Token的数量有效降低空间冗余信息，同时也减少了模型计算量。最终通过全连接层完成视频行为识别。实验在ActivityNet-v1.3、Mini-Kinetics数据集上进行验证。实验表明，本文基于时空采样的视频行为识别方法，具有较小计算量的同时，能够达到现有行为识别方法的准确率。<br />
Video features contain the time and space redundancy information when the action is executed. This information has nothing to do with the action category, which will interfere with the action identification and cause the wrong judgment of the action category. This thesis proposes a video action recognition model based on spatiotemporal sampling. The model includes key frame sampling and Token sampling video Transformer. Key frame sampling, by quantifying the pixel difference between adjacent frames, identifies key frames with significant changes, accumulates the update probability of multiple consecutive frames, processes the possible long time interval between two key frames, introduces a trained sampling probability threshold to binarize the update probability, enhances the modeling ability of key frames, and ensures the acquisition of video key information. This thesis believes that different tokens have different importance to recognition tasks. Therefore, in the Transformer block, this thesis adopts a data-dependent Token sampling strategy to reduce the number of tokens by layers to effectively reduce spatial redundancy information and reduce the amount of computation. Finally, the video action recognition is completed through the fully-connected layer. The experiments are validated on ActivityNet-v1.3, Mini-Kinetics dataset. The experiments show that in this thesis, the action recognition method based on spatiotemporal sampling, can achieve the accuracy of existing action recognition methods with less computation.
%K 视频行为识别，时空采样，视频Transformer<br>Video Action Recognition
%K Saptio-Temporal Sampling
%K Video Transformer
%U http://www.hanspub.org/journal/PaperInformation.aspx?PaperID=87430