数据与计算发展前沿 ›› 2024, Vol. 6 ›› Issue (5): 66-79.

CSTR: 32002.14.jfdc.CN10-1649/TP.2024.05.007

doi: 10.11871/jfdc.issn.2096-742X.2024.05.007

• • 上一篇    下一篇

基于主动感知机制的视频行为识别方法研究

晏直誉1(),茹一伟2,孙福鹏3,孙哲南2,*()   

  1. 1.北京理工大学,计算机学院,北京 102488
    2.中国科学院自动化研究所,多模态人工智能系统全国重点实验室,北京 100190
    3.北京理工大学,数学与统计学院,北京 102488
  • 收稿日期:2023-08-09 出版日期:2024-10-20 发布日期:2024-10-21
  • 通讯作者: * 孙哲南(E-mail: znsun@nlpr.ia.ac.cn
  • 作者简介:晏直誉,北京理工大学计算机学院,硕士研究生,主要研究方向为行为识别。
    本文中负责处理实验和论文撰写。
    YAN Zhiyu, Graduate student at School of Computer Science, Beijing University of Technology. He main research interests include video behavior recognition.
    In this paper, he is responsible for conducting experiments and writing the paper.
    E-mail: yanzhiyu@bit.edu.cn|孙哲南,中国科学院自动化研究所,研究员,主要研究方向为生物特征识别、模式识别、计算机视觉。
    本文中负责总体思路指导与论文修改。
    SUN Zhenan, corresponding author, researcher, Institute of Automation, Chinese Academy of Sciences. His main research interests include biometric feature recognition, pattern recognition, and computer vision.
    In this paper, he is responsible for the guidance and revision of this paper.
    E-mail: znsun@nlpr.ia.ac.cn
  • 基金资助:
    国家自然科学基金面上项目“人脸识别深度特征模型的可解释性研究与应用”(62276263)

Research on Video Behavior Recognition Method with Active Perception Mechanism

YAN Zhiyu1(),RU Yiwei2,SUN Fupeng3,SUN Zhenan2,*()   

  1. 1. School of Computer Science, Beijing Institute of Technology, Beijing 102488, China
    2. State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
    3. School of Mathematics and Statistics, Beijing Institute of Technology, Beijing 102488, China
  • Received:2023-08-09 Online:2024-10-20 Published:2024-10-21

摘要:

【目的】在视频行为识别领域,如何有效关注视频帧中的重要区域并充分利用时空信息是一个重要的研究课题。【方法】本文提出了一种主动感知机制(APM),能够主动感知视频中的关键区域。该方法采用了一种基于时空多尺度注意机制的新型网络模型,建立了一个“审视-浏览”网络。审视分支和浏览分支各自嵌入了多尺度视觉Transformer结构,使模型在感知重要区域时具备自注意力主动性,并在数据处理的每个阶段具备时空多尺度主动性。为了保持帧间信息的一致性的同时进行数据增广以提高鲁棒性,进一步引入了多重双随机数据增强方法来实现样本扩增和数据增强。【结果】在Kinetics-400和Kinetics-600大规模人体行为识别基准数据集上,本文设计的方法取得了有竞争力的结果。

关键词: 行为识别, 自注意力机制, 深度学习, 视频, Transformer

Abstract:

[Purpose]In the field of video behavior recognition, how to effectively focus on important regions in video frames and make full use of spatiotemporal information is a significant research issue. [Methods] This paper proposes an Active Perception Mechanism (APM) that actively perceives crucial regions in videos. Specifically, the method employs a novel network model based on a spatiotemporal multi-scale attention mechanism, which establishes a “scrutinizing-browsing” network. The scrutinizing branch and browsing branch each embeds Multiscale Vision Transformer structures, enabling the model to equip self-attention initiative in perceiving important regions and spatiotemporal multi-scale initiative in each stage of data processing. To maintain the consistency of inter-frame information while obtaining augmented data to improve robustness, we introduce a multi-dual-random data augmentation method to realize sample amplification and data enhancement. [Results] On the large-scale human behavior recognition benchmarks of Kinetics-400 and Kinetics-600 datasets, our designed method achieves competitive results.

Key words: action recognition, self-attention mechanism, deep learning, video, Transformer