Research on Video Behavior Recognition Method with Active Perception Mechanism

doi:10.11871/jfdc.issn.2096-742X.2024.05.007

Abstract

Abstract:

[Purpose]In the field of video behavior recognition, how to effectively focus on important regions in video frames and make full use of spatiotemporal information is a significant research issue. [Methods] This paper proposes an Active Perception Mechanism (APM) that actively perceives crucial regions in videos. Specifically, the method employs a novel network model based on a spatiotemporal multi-scale attention mechanism, which establishes a “scrutinizing-browsing” network. The scrutinizing branch and browsing branch each embeds Multiscale Vision Transformer structures, enabling the model to equip self-attention initiative in perceiving important regions and spatiotemporal multi-scale initiative in each stage of data processing. To maintain the consistency of inter-frame information while obtaining augmented data to improve robustness, we introduce a multi-dual-random data augmentation method to realize sample amplification and data enhancement. [Results] On the large-scale human behavior recognition benchmarks of Kinetics-400 and Kinetics-600 datasets, our designed method achieves competitive results.

Key words: action recognition, self-attention mechanism, deep learning, video, Transformer

YAN Zhiyu, RU Yiwei, SUN Fupeng, SUN Zhenan. Research on Video Behavior Recognition Method with Active Perception Mechanism[J]. Frontiers of Data and Computing, 2024, 6(5): 66-79, https://cstr.cn/32002.14.jfdc.CN10-1649/TP.2024.05.007.

Figures/Tables 9

Fig.1

Table 1

Active Perception Machine (APM) base model"

	审视分支	输出大小	浏览分支	输出大小
$D 1$	stride $τ × 1 × 1$	$D × T × H × W$	stride $α × τ × 1 × 1$	$D^× T α × H × W$
$C$	$C T × C H × C W$ 投影维度 $D$ stride $S T × 4 × 4$	$D × T s T × H 4 ? × ? W 4$	$α × C T × C H × C W$ 投影维度 $D^= D × α$ stride $α × S T × 4 × 4$	$D^× T α × s T × H 4 ? × ? W 4$
$S 1$	$M H P A D M L P 4 D × N 1$	$D × T s T × H 4 ? × ? W 4$	$M H P A D^M L P 4 D^× N 1 × β$	$D^× T α × s T × H 4 ? × ? W 4$
$S 2$	$M H P A 2 D M L P 8 D × N 2$	$D × T s T × H 8 ? × ? W 8$	$M H P A 2 D^M L P 8 D^× N 2 × β$	$D^× T α × s T × H 8 ? × ? W 8$
$S 3$	$M H P A 4 D M L P 16 D × N 3$	$D × T s T × H 16 ? × ? W 16$	$M H P A 4 D^M L P 16 D^× N 3 × β$	$D^× T α × s T × H 16 ? × ? W 16$
$S 4$	$M H P A 8 D M L P 32 D × N 4$	$D × T s T × H 32 ? × ? W 32$	$M H P A 8 D^M L P 32 D^× N 4 × β$	$D^× T α × s T × H 32 ? × ? W 32$

Table 1

Fig.2

Fig.3

Table 2

Table 3

Table 4

Table 5

The results of two branches using different structures on the Kinetics-400 dataset"

模型	$β$	top-1	top-5
MViTv2 32×3, MViTv2 16×4	0.3	83.11	95.07
MVITv2 32×3, MVITv1 32×3	0.21	83.14	95.73
MVITv2 32×3, SlowFast 4×16, R50	0.02	82.98	95.64
MVITv2 32×3, SlowFast 8×8, R50	0.07	83.11	95.52
MViTv2 16×4, SlowFast 4×16, R50	0.06	81.21	94.73
MViTv2 16×4, SlowFast 8×8, R50	0.09	81.47	94.75
SlowFast 4×16,R50, SlowFast 8×8, R50	0.49	77.16	92.39

Table 5

Table 6

Specific structural design of different APM models"

	审视分支	浏览分支
D	$6 × 1 × 1$	$4 × 1 × 1$
S1	$M H P A 96 M L P 384 × 1$	$M H P A 58 M L P 232 × 1$
S2	$M H P A 192 M L P 768 × 2$	$M H P A 116 M L P 464 × 1$
S3	$M H P A 384 M L P 1,536 × 11$	$M H P A 232 M L P 928 × 6$
S4	$M H P A 768 M L P 3,072 × 2$	$M H P A 464 M L P 1,856 × 1$
$? ? D = 96 ? ? α = 0.6 ? ? β = 0.1$
(a) APM-1模型结构
	审视分支	浏览分支
D	$4 × 1 × 1$	$3 × 1 × 1$
S1	$M H P A 96 M L P 384 × 2$	$M H P A 86 M L P 344 × 1$
S2	$M H P A 192 M L P 768 × 3$	$M H P A 172 M L P 688 × 3$
S3	$M H P A 384 M L P 1,536 × 16$	$M H P A 344 M L P 1,376 × 15$
S4	$M H P A 768 M L P 3,072 × 3$	$M H P A 688 M L P 2,752 × 2$
$? ? D = 96 ? ? α = 0.9 ? ? β = 0.9$
(b) APM-2模型结构
	审视分支	浏览分支
D	$4 × 1 × 1$	$3 × 1 × 1$
S1	$M H P A 144 M L P 576 × 2$	$M H P A 130 M L P 520 × 2$
S2	$M H P A 288 M L P 1,152 × 6$	$M H P A 260 M L P 1,040 × 6$
S3	$M H P A 576 M L P 2,304 × 36$	$M H P A 520 M L P 2,080 × 35$
S4	$M H P A 1,152 M L P 4,608 × 4$	$M H P A 1,040 M L P 4,160 × 4$
$? ? D = 144 ? ? α = 0.9 ? ? β = 0.9$
(c) APM-3模型结构

Table 6

References 59

[1]	SUN Z, KE Q, RAHMANI H, et al. Human action recognition from various data modalities: A review[J]. IEEE transactions on pattern analysis and machine intelligence, 2022, 45(3): 3200-3225.
[2]	LIU T, MA Y, YANG W, et al. Spatial-temporal interaction learning based two-stream network for action recognition[J]. Information Sciences, 2022, 606: 864-876.
[3]	HUANG Y, GUO Y, GAO C. Efficient parallel inflated 3D convolution architecture for action recognition[J]. IEEE Access, 2020, 8: 45753-45765.
[4]	FEICHTENHOFER C, FAN H, MALIK J, et al. Slowfast networks for video recognition[C]// Proceedings of the IEEE/CVF international conference on computer vision, 2019: 6202-6211.
[5]	ARNAB A, DEHGHANI M, HEIGOLD G, et al. Vivit: A video vision transformer[C]// Proceedings of the IEEE/CVF international conference on computer vision, 2021: 6836-6846.
[6]	BERTASIUS G, WANG H, TORRESANI L. Is space-time attention all you need for video understanding?[C]// ICML, 2021, 2(3): 4.
[7]	ZHENG Z, AN G, WU D, et al. Global and local knowledge-aware attention network for action recognition[J]. IEEE transactions on neural networks and learning systems, 2020, 32(1): 334-347.
[8]	CHEN J, HO C M. MM-ViT: Multi-modal video transformer for compressed video action recognition[C]// Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2022: 1910-1921.
[9]	ZHU Y, LAN Z, NEWSAM S, et al. Hidden two-stream convolutional networks for action recognition[C]// Computer Vision-ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2-6, 2018, Revised Selected Papers, Part III 14. Springer International Publishing, 2019: 363-378.
[10]	WEI D, TIAN Y, WEI L, et al. Efficient dual attention slowfast networks for video action recognition[J]. Computer Vision and Image Understanding, 2022, 222: 103484.
[11]	LI Y, WU C Y, FAN H, et al. MViTv2: Improved Multiscale Vision Transformers for Classification and Detection[C]// Computer Vision and Pattern Recognition, 2022: 4794-4804.
[12]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv preprint arXiv:2010. 11929, 2020: 1-1.
[13]	SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videos[J]. Advances in neural information processing systems, 2014, 1: 568-576.
[14]	WANG L, XIONG Y, WANG Z, et al. Temporal segment networks: Towards good practices for deep action recognition[C]// European conference on computer vision, Springer, Cham, 2016: 20-36.
[15]	GIRDHAR R, RAMANAN D, GUPTA A, et al. Actionvlad: Learning spatio-temporal aggregation for action classification[C]// Proceedings of the IEEE conference on computer vision and pattern recognition, 2017: 971-980.
[16]	DIBA A, SHARMA V, VAN GOOL L. Deep temporal linear encoding networks[C]// Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017: 2329-2338.
[17]	FEICHTENHOFER C, PINZ A, WILDES R P. Spatiotemporal multiplier networks for video action recognition[C]// Proceedings of the IEEE conference on computer vision and pattern recognition, 2017: 4768-4777.
[18]	CHEN Y, GE H, LIU Y, et al. AGPN: Action Granularity Pyramid Network for Video Action Recognition[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023: 1-1.
[19]	BACCOUCHE M, MAMALET F, WOLF C, et al. Sequential deep learning for human action recognition[C]// Human Behavior Understanding: Second International Workshop, HBU 2011, Amsterdam, The Netherlands, November 16, 2011. Proceedings 2. Springer Berlin Heidelberg, 2011: 29-39.
[20]	TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3d convolutional networks[C]// Proceedings of the IEEE international conference on computer vision, 2015: 4489-4497.
[21]	CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? a new model and the kinetics dataset[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 6299-6308.
[22]	TRAN D, WANG H, TORRESANI L, et al. A closer look at spatiotemporal convolutions for action recognition[C]// Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018: 6450-6459.
[23]	ZHOU Y, SUN X, ZHA Z J, et al. Mict: Mixed 3d/2d convolutional tube for human action recognition[C]// Proceedings of the IEEE conference on computer vision and pattern recognition, 2018: 449-458.
[24]	DU W, WANG Y, QIAO Y. Rpan: An end-to-end recurrent pose-attention network for action recognition in videos[C]// Proceedings of the IEEE international conference on computer vision, 2017: 3725-3734.
[25]	SUN L, JIA K, CHEN K, et al. Lattice long short-term memory for human action recognition[C]// Proceedings of the IEEE international conference on computer vision, 2017: 2147-2156.
[26]	PERRETT T, DAMEN D. DDLSTM: dual-domain LSTM for cross-dataset action recognition[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 7852-7861.
[27]	MENG Y, LIN C C, PANDA R, et al. Ar-net: Adaptive frame resolution for efficient action recognition[C]//Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part VII 16. Springer International Publishing, 2020: 86-104.
[28]	DONAHUE J, ANNE HENDRICKS L, GUADARR-AMA S, et al. Long-term recurrent convolutional networks for visual recognition and description[C]// Proceedings of the IEEE conference on computer vision and pattern recognition, 2015: 2625-2634.
[29]	SHI Y, TIAN Y, WANG Y, et al. Learning long-term dependencies for action recognition with a biologi-cally-inspired deep network[C]// Proceedings of the IEEE International Conference on Computer Vision, 2017: 716-725.
[30]	DWIBEDI D, SERMANET P, TOMPSON J. Temporal reasoning in videos using convolutional gated recurrent units[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018: 1111-1116.
[31]	KIM P S, LEE D G, LEE S W. Discriminative context learning with gated recurrent unit for group activity recognition[J]. Pattern Recognition, 2018, 76: 149- 161.
[32]	CHO K, VAN MERRIËNBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[J]. arXiv preprint arXiv:1406.1078, 2014: 1-1.
[33]	ZHANG M, YANG Y, JI Y, et al. Recurrent attention network using spatial-temporal relations for action recognition[J]. Signal Processing, 2018, 145: 137-145.
[34]	ZHANG J, HU H, LU X. Moving foreground-aware visual attention and key volume mining for human action recognition[J]. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2019, 15(3): 1-16.
[35]	SHARMA S, KIROS R, SALAKHUTDINOV R. Action recognition using visual attention[J]. arXiv preprint arXiv:1511.04119, 2015:1-1.
[36]	GIRDHAR R, RAMANAN D. Attentional pooling for action recognition[J]. Advances in neural information processing systems, 2017, 30.
[37]	LI Z, GAVRILYUK K, GAVVES E, et al. Videolstm convolves, attends and flows for action recognition[J]. Computer Vision and Image Understanding, 2018, 166: 41-50.
[38]	PENG Y, ZHAO Y, ZHANG J. Two-stream collaborative learning with spatial-temporal attention for video classification[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2018, 29(3): 773-786.
[39]	FU J, GAO J, XU C. Learning semantic-aware spatial-temporal attention for interpretable action recognition[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 32(8): 5213-5224.
[40]	NEIMARK D, BAR O, ZOHAR M, et al. Video transformer network[C]// Proceedings of the IEEE/CVF international conference on computer vision, 2021: 3163-3172.
[41]	BULAT A, PEREZ RUA J M, SUDHAKARAN S, et al. Space-time mixing attention for video transformer[J]. Advances in neural information processing systems, 2021, 34: 19594-19607.
[42]	ZHA X, ZHU W, XUN L, et al. Shifted chunk transformer for spatio-temporal representational learning[J]. Advances in Neural Information Processing Systems, 2021, 34: 11384-11396.
[43]	GUO H, WANG H, JI Q. Uncertainty-guided probabilistic transformer for complex action recognition[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 20052-20061.
[44]	LIU C, LI X, CHEN H, et al. Selective feature compression for efficient activity recognition inference[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 13628-13637.
[45]	YANG J, DONG X, LIU L, et al. Recurring the transformer for video action recognition[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 14063-14073.
[46]	PAN B, PANDA R, JIANG Y, et al. IA-RED $^ 2$: Interpretability-Aware Redundancy Reduction for Vision Transformers[J]. Advances in Neural Information Processing Systems, 2021, 34: 24898-24911.
[47]	LIU Q, CHE X, BIE M. R-STAN: Residual spatial-temporal attention network for action recognition[J]. IEEE Access, 2019, 7: 82246-82255.
[48]	FAN H, XIONG B, MANGALAM K, et al. Multiscale vision transformers[C]// Proceedings of the IEEE/CVF international conference on computer vision, 2021: 6824-6835.
[49]	ZHANG H, CISSE M, DAUPHIN Y N, et al. mixup: Beyond empirical risk minimization[J]. arXiv preprint arXiv:1710.09412, 2017: 1-1.
[50]	CHEN Y, KALANTIDIS Y, LI J, et al. Multi-fiber networks for video recognition[C]// Proceedings of the european conference on computer vision (ECCV), 2018: 352-367.
[51]	LIN J, GAN C, HAN S. Tsm: Temporal shift module for efficient video understanding[C]// Proceedings of the IEEE/CVF international conference on computer vision, 2019: 7083-7093.
[52]	WANG X, GIRSHICK R, GUPTA A, et al. Non-local neural networks[C]// Proceedings of the IEEE conference on computer vision and pattern recognition, 2018: 7794-7803.
[53]	CHEN Y, FAN H, XU B, et al. Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution[C]// Proceedings of the IEEE/CVF international conference on computer vision, 2019: 3435-3444.
[54]	TRAN D, WANG H, TORRESANI L, et al. Video classification with channel-separated convolutional networks[C]// Proceedings of the IEEE/CVF international conference on computer vision, 2019: 5552-5561.
[55]	FEICHTENHOFER C. X3d: Expanding architectures for efficient video recognition[C]// Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020: 203-213.
[56]	KAY W, CARREIRA J, SIMONYAN K, et al. The kinetics human action video dataset[J]. arXiv preprint arXiv:1705.06950, 2017: 1-1.
[57]	CARREIRA J, NOLAND E, BANKI-HORVATH A, et al. A short note about kinetics-600[J]. arXiv preprint arXiv:1808.01340, 2018: 1-1.
[58]	KONDRATYUK D, YUAN L, LI Y, et al. Movinets: Mobile video networks for efficient video recognition[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 16020-16030.
[59]	LIU Z, NING J, CAO Y, et al. Video swin transformer[C]// Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022: 3202-3211.

模型	预训练	top-1	top-5	GFLOPs×views	Param
I3D^[21]	ImageNet	72.1	90.3	108×N/A	12M
Two-Stream I3D^[21]	ImageNet	75.7	92.0	216×N/A	25M
MF-Net^[50]	ImageNet	72.8	90.4	11.1×50	8.0M
TSM R50^[51]	ImageNet	74.7	N/A	65×10	24.3M
Nonlocal R50^[52]	ImageNet	76.5	92.6	282×30	35.3M
Nonlocal R101^[52]	ImageNet	77.7	93.3	359×30	54.3M
APGN^[18]	ImageNet	77.7	93.4	-	27.6M
Two-Stream I3D^[21]	-	71.6	90.0	216×N/A	25M
R(2+1)D^[22]	-	72.0	90.0	152×115	63.6M
Two-Stream R(2+1)D^[22]	-	73.9	90.9	304×115	127.2M
Oct-I3D + NL^[53]	-	75.7	N/A	28.9×30	33.6M
ip-CSN-152^[54]	-	77.8	92.8	109×30	32.8M
X3D-M^[55]	-	76.0	92.3	6.2×30	3.8M
X3D-XL^[55]	-	79.1	93.9	48.4×30	11.0M
SlowFast 4×16,R50^[4]	-	75.6	92.1	36.1×30	34.4M
SlowFast 8×8,R101^[4]	-	77.9	93.2	106×30	53.7M
SlowFast 8×8,R101+NL^[4]	-	78.7	93.5	116×30	59.9M
SlowFast 16×8,R101+NL^[4]	-	79.8	93.9	234×30	59.9M
MViTv1,16×4^[48]	-	78.4	93.5	70.3×5	36.6M
MViTv1,32×3^[48]	-	80.2	94.4	170×5	36.6M
MViTv2-S,16×4^[11]	-	81.0	94.6	64×5	34.5M
MViTv2-B,32×3^[11]	-	82.9	95.7	225×5	51.2M
APM-1-K400 (Ours)	-	82.5	94.3	57.6×5	31.1M
APM-2-K400 (Ours)	-	83.1	95.1	177×5	46.2M
APM-3-K400 (Ours)	-	83.2	95.8	215×5	58.6M

模型	预训练	top-1	top-5	GFLOPs×views	Param
I3D^[57]	-	71.9	90.1	108×N/A	12M
SlowFast 16×8,R101+NL^[4]	-	81.8	95.1	234×30	59.9M
X3D-M^[55]	-	78.8	94.5	6.2×30	3.8M
X3D-XL^[55]	-	81.9	95.5	48.4×30	11.0M
MoViNet-A6^[58]	-	84.8	96.5	386×1	31.4M
ViT-B-TimeSformer^[6]	ImageNet	82.4	96.0	1,703×3	121.4M
ViT-L-ViViT^[5]	ImageNet	83.0	95.7	3992×12	310.8M
Swim-B^[59]	ImageNet	84.0	96.5	282×12	88.1M
Swim-L^[59]	ImageNet	86.1	97.3	2107×50	200.0M
MViTv1,16×4^[48]	-	82.1	95.5	70.5×5	36.8M
MViTv1,32×3^[48]	-	83.4	96.3	170×5	36.8M
MViTv2-B,32×3^[11]	-	85.5	97.2	206×5	51.4M
APM-1-K600 (Ours)	-	82.4	95.6	58.3×5	32.4M
APM-2-K600 (Ours)	-	83.3	96.6	178×5	46.8M
APM-3-K600 (Ours)	-	85.8	97.5	215.5×5	59.4M

α β	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9
0.1	77.61	76.44	75.28	74.12	72.95	71.79	70.62	69.46	68.30
0.2	78.41	77.56	76.72	75.87	75.02	74.17	73.33	72.48	71.63
0.3	79.05	78.53	78.00	77.47	76.94	76.42	75.89	75.36	74.84
0.4	80.49	80.02	79.54	79.07	78.59	78.11	77.64	77.16	76.69
0.5	79.83	79.62	79.42	79.21	79.01	78.80	78.60	78.40	78.19
0.6	80.11	80.05	79.99	79.92	79.86	79.80	79.74	79.68	79.62
0.7	80.09	80.15	80.22	80.29	80.35	80.42	80.49	80.55	80.62
0.8	80.32	80.48	80.64	80.80	80.96	81.12	81.28	81.44	81.60
0.9	80.47	80.72	80.98	81.23	81.48	81.73	81.99	82.24	82.49