（注：为避免中文翻译不准确带来误解，故附上论文原句。）

论文：Wang L , Xiong Y , Wang Z , et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition[J]. 2016.

链接：https://arxiv.org/abs/1608.00859

论文发表在ECCV2016，提出TSN（temporal segment network）做video的action recognition，TSN可看成是two-stream算法的改进(two-stream算法可参考之前的博客)。
作者首先提出了使用卷积做视频动作识别的2大困难：

First, long-range temporal structure plays an important role in understanding the dynamics in action videos. However, mainstream ConvNet frameworks usually focus on appearances and short-term motions, thus lacking the capacity to incorporate long-range temporal structure. Recently there are a few attempts to deal with this problem. These methods mostly rely on dense temporal sampling with a pre-defined sampling interval. This approach would incur excessive computational cost when applied to long video sequences, which limits its application in real-world practice and poses a risk of missing important information for videos longer than the maximal sequence length.

Second, in practice, training deep ConvNets requires a large volume of training samples to achieve optimal performance. However, due to the difficulty in data collection and annotation, publicly available action recognition datasets (e.g. UCF101, HMDB51) remain limited, in both size and diversity. Consequently, very deep ConvNets, which have attained remarkable success in image classification, are confronted with high risk of over-fitting.

概括一下，就是：
1、long-range temporal对理解视频中动态行为非常重要，但是很多现有的网络不能get到long-range temporal，他们中有些使用密度时间采样(dense temporal sampling),它只取视频中的一小部分，所以计算量很大且在长视频中会丢失信息。
2、训练数据数据少（2015年），容易过拟合。

针对2大困难，作者提出解决方案，也就是本文的贡献：

Our first contribution is temporal segment network (TSN), a novel framework for video-based action recognition. which is based on the idea of long-range temporal structure modeling. It combines a sparse temporal sampling strategy and video-level supervision to enable efficient and effective learning using the whole action video.（来之Abstract，Introduction倒数第三段也是同样的意思）

to overcome the aforementioned difficulties caused by the limited number of training samples, including 1) cross-modality pre-training; 2) regularization; 3) enhanced data augmentation. Meanwhile, to fully utilize visual content from videos, we empirically study four types of input modalities to two-stream ConvNets, namely a single RGB image, stacked RGB difference, stacked optical flow field, and stacked warped optical flow field.（来自Introduction倒数第二段）

概括一下，就是：
1、提出了TSN网络，采用稀疏采样，保证计算的高效性，而且能有效利用视频所有帧，而不像two-stream算法只用单张图片及一段连续光流(冗余信息).
2、一些训练技巧：1).cross-modality 预训练；2).正则化；3).数据增强。为了充分利用视频中的视觉内容,使用4种特征输入：RGB、帧差图像、光流、warp光流。

下面将对作者贡献点中的一些概念进行解释

TSN网络

网络主体和two-stream算法一样分为spatial stream convnet(上图绿色方块) 和 temporal stream convnet(上图蓝色方块)，只不过使用了更深的BN-Inception网络，最后融合了多个采样识别结果。
训练时样本随机采样得到，即把视频平均分为K份，图中为3份，在每一份中在随机取出1帧RGB图像作为spatial convnet的输入，及一定数量（未从论文中get到）的光流作为temporal convnet的输入，每个segment都可以得到一个分类分数，图中spatial和temporal各3个分数，再将这些分数使用某种方法(方法有：evenly averaging, maximum, and weighted averaging)进行融合得到各自类别分数，在训练中spatial和temporal是分开训练的，所以可以单独使用，同时使用spatial和temporal预测时，需要加一个权重。

训练技巧

1、训练spatial convnet网络时，采用在ImageNet预训练的模型，进行初始化，然后进行fine-tuning。
2、训练temporal convnet网络时，作者提出了Cross Modality预训练的方式，即设法用RGB模态的网络权值来初始化其他模态网络的权值！当然不能直接拷贝过去，因为数据分布就不一样！怎么呢办呢？将其他模态的数据线性扩展到0~255（RGB的数据分布）。输入数据分布的改变直接影响到网络的第一个卷积层，因此，作者修改人为的修改了第一层卷积层的权值，按照输入维度进行平均后，复制到其他输入通道。
3. BN层的修改：BN层将batch数据转换成符合标准的高斯分布，加速收敛，但是会有过拟合的风险。因此，论文中选择不更新mean和variable。这种修改叫做Partial BN。(不太理解)
4、数据增强
5、DropOut

个人感受

主要想学习一下网络结构，是two-stream的增强版，网络方面主要采用稀疏采样策略。使用了多种输入(RGB、帧差图像、光流、warp光流)，个人觉得有点复杂，只使用原始RGB输入(简单)，而且能取得很好的结果（识别率高），计算开销小，这才是好的算法。训练时使用了很多技巧，这些都非常值得借鉴学习。

参考资料：

项目主页
https://blog.csdn.net/Eudemonia_mia/article/details/82956311
https://blog.csdn.net/charel_chen/article/details/81350260
https://blog.csdn.net/small_ARM/article/details/78524442
https://blog.csdn.net/u010579901/article/details/80264496

动作识别阅读笔记(三)《Temporal Segment Networks: Towards Good Practices for Deep Action Recognition》相关推荐

行为识别论文笔记|TSN|Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
行为识别论文笔记|TSN|Temporal Segment Networks: Towards Good Practices for Deep Action Recognition Temporal ...
视频动作识别--Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition ECCV2016 https://githu ...
论文学习：（TSN）Temporal segment networks: Towards good practices for deep action recognition
论文:<Temporal Segment Networks:Towards Good Practices for Deep Action Recognition> 目录 0.导论 1.TS ...
【论文阅读】Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
Abstract 卷积网络在动作识别领域带来的提升不像图像领域那么大提出TSN,基于长距离时序建模的思想,结合时序稀疏采样(sparse temporal sampling)策略和视频级监督(vid ...
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition(时间段网络：使用深度行为识别的良好实现)
本文的原作者为Limin Wang等人原文地址 #摘要深度卷积网络在静止图像中的视觉识别方面取得了巨大成功.然而,对于视频中的动作识别,优于传统方法的优势并不明显.本文旨在探索为视频中的动作识别设计 ...
TSN(Temporal Segment Networks)算法笔记
论文:Temporal Segment Networks: Towards Good Practices for Deep Action Recognition 论文链接:https://arxiv. ...
Temporal Segment Networks for Action Recognition in Videos 用于动作识别的时序分割网络
Temporal Segment Networks for Action Recognition in Videos 用于动作识别的时序分割网络本文原创,欢迎转载 https://blog.csdn ...
[行为识别论文详解]TSN(Temporal Segment Networks)
摘要本文旨在设计有效的卷积网络体系结构用于视频中的动作识别,并在有限的训练样本下进行模型学习.TSN基于two-stream方法构建. 论文主要贡献: 提出了TSN(Temporal Segment ...
【TSN（Temporal Segment Networks）】
TSN可以看做是双流(two stream)系列的改进. 在此基础上,TSN网络要解决两个问题1.是长时间视频的行为判断问题(有些视频的动作时间较长).2.是解决数据少的问题,数据量少会使得一 ...

动作识别阅读笔记(三)《Temporal Segment Networks: Towards Good Practices for Deep Action Recognition》

TSN网络

训练技巧

个人感受

动作识别阅读笔记(三)《Temporal Segment Networks: Towards Good Practices for Deep Action Recognition》相关推荐

最新文章

热门文章