行为识别论文笔记-ARTNet-Appearance-and-Relation Networks for Video Classification

Wang, Limin, et al. “Appearance-and-relation networks for video classification.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

Motivation

3 kinds of architectures for video classification: (1) two-stream CNNs (time-consuming, optical flow in advance) (2) 3D CNNs (worse than two stream) and (3) 2D CNNs with temporal models on top such as LSTM, temporal convolution, sparse sampling and aggregation, and attention modeling. (worse in local spatiotemporal representation)

multiplicative interactions to model relation between different views: Gated Boltzmann machines, Energy models, Independent Subspace Analysis (ISA)(similar to Energy model but its weights are trained from data); Apply for optical flow estimation and person re-identification

Some energy models:

Derpanis, Konstantinos G., et al. “Efficient action spotting based on a spacetime oriented structure representation.” 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 2010.

Wang, LiMin, Yu Qiao, and Xiaoou Tang. “Motionlets: Mid-level 3d parts for human motion recognition.” Proceedings of the ieee conference on computer vision and pattern recognition. 2013. ( work of author himself

Solutions

multiple stacked SMART blocks

Two branches of SMART block: (1) appearance branch for spatial modeling (2) relation branch for temporal modeling

flexible implementations：

save 3D CNNs computation consumption and promise acc

enhance the ability of local representations for long-term Models

Relation branch: square-pooling architecture (similar with ISA) to learn appearance-independent relation between frames
- Square function: a hidden unit Z_k from two patches x and y from consecutive frames (这个模块的前向反向估计要自己实现一下，到时候研究 code 实验中stride=2 是为了在两个patch之间建模吧，表达局部特征）
- Cross channel pooling: 1 x 1 x 1 conv
Appearance branch: 2D CNN

Experiments

Kinetics train, UCF101 HMDB test

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-gEvGNrZZ-1606697177469)(行为识别论文笔记-ARTNet-Appearance-and-Relation Networks for Video Classification.assets/image-20201128235023194.png)]

SMART bolck 参数量和计算效率是升高的，这可能与 GPU 没有针对性优化有关，但从公式上看，3D CNN就是z=x+y，Square function 是 z = (x+y)^2 的时间效率更低；

x+y 保证线性，x*y 保证独立性

English Expression

Assuming the independence between appearance and relation, it is reasonable to decouple these two kinds of information when designing learning modules.

Advantages and Drawbacks

自己创造了一种energy model的形式，在来自于连续帧上的patch建模（2帧间的局部特征表达，多帧跳帧是否支持未知，需要check code
时序建模上，不见得比 3D CNN 效率高，参数少；可解释性上，自己造的Z_k计算公式也难说比 3D CNN可解释性好；3D CNN肯定是支持多帧时序建模的，本文sqaure-pooling是否仅支持连续2帧间建模？这需要check code了。有check过的兄弟告知一声昂
比longterm model 更关注局部特征，好
ARTNet 的 stakced block 没有使用残差结构，网络越深，时序信息越弱

Reference: