论文名称：Self-supervised video representation learning with space-time cubic puzzles（2019 AAAI）

论文作者：Dahun Kim, Donghyeon Cho, In So Kweon

下载地址：https://ojs.aaai.org/index.php/AAAI/article/view/4873

Contributions

In this paper, we introduce a new self-supervised task called as Space-Time Cubic Puzzles to train 3D CNNs (3D ResNet-18) using large scale video dataset. Given a randomly permuted 3D spatio-temporal crops extracted from each video clips, we train a network to predict their original spatio-temporal arrangement. By completing Space-Time Cubic Puzzles, the network learns self-supervised video representation from unlabeled videos. In experiments, we demonstrate that our learned 3D representation is well transferred to action recognition tasks, and outperforms state-of-the-art 2D CNN-based competitors on UCF101 and HMDB51 datasets.

Method

1、Pretext Task: Space-Time Cubic Puzzles

To generate the puzzle pieces, we consider a spatio- temporal cuboid consisting of 2 × 2 × 4 grid cells for each video. Then, we sample 4 crops instead of 16, in either spatial or temporal dimension. More specifically, the 3D crops are extracted from a 4-cell grid of shape 2×2×1 (colored in blue) or 1 × 1 × 4 (colored in red) along the spatial or temporal dimension respectively.

2、Network

We use a late-fusion architecture. It is a 4-tower siamese network, where the towers share the same parameters, and follow the 3D ResNet architecture. Each 3D crops are processed separately until the fully-connected layer. Furthermore, each towers are agnostic of whether it was spatial or temporal dimension the input crops had been sampled from. Similar to the jigsaw puzzle problem, we formulate the rearrangement problem as a multi-class classification task. In practice, for each tuple of four crops, we flip all the frames upside-down with 50% probability, doubling the number of classes to 48 (that is, 2×4!) to further boost our performance.

3、Avoiding Trivial Learning

When designing a pretext task, it is crucial to ensure that the task forces the network to learn the desired semantic structure, without bypassing the understanding by finding low-level clues that reveal the location of a video crop. So, we choose channel replication as our data preprocessing. Another often-cited worry in all context-based works relates to trivial low-level boundary pattern completion. Thus, we apply spatio-temporal jittering when extracting each video crops from the grid cells to avoid the trivial cases

4、Implementation Details

We use video clips with 224 × 224 pixel frames and convert every video file into PNG images in our experiments. We sample 128 consecutive frames from each clip, and split them into 2 × 2 × 4-cell grid; That is, one grid cell consists of 112 × 112 × 32 pixels, and for each cell, we sample 80×80×16 pixels with random jittering to generate a 3D video crop. During the fine-tuning and testing, we randomly sample 16 consecutive frames for each clip, and spatially resize the frames at 112×112 pixels. In testing, we adopt the sliding window manner to generate input clips, so that each video is split into non-overlapped 16-frame clips. The clip class scores are averaged over all the clips of the video.

Results

论文阅读：Self-supervised video representation learning with space-time cubic puzzles相关推荐

论文阅读：Self-Supervised Video Representation Learning With Odd-One-Out Networks
目录 Contributions Method 1.Model 2.Three sampling strategies. 3.Video frame encoding. Results More Re ...
论文阅读：Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting
题目:Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting 作者:Marti ...
【论文阅读】InfoGAN: Interpretable Representation Learning by Information Maximizing GAN
论文下载 bib: @inproceedings{chenduan2016infogan,author = {Xi Chen and Yan Duan and Rein Houthooft and J ...
论文阅读之Improved Word Representation Learning with Sememes(2017)
文章目录论文介绍 Conventional Skip-gram Model Simple Sememe Aggregation Model(SSA) Sememe Attention over Co ...
论文阅读——Mockingjay: unsupervised speech representation learning
<Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Enco ...
论文阅读GraphSAGE《Inductive Representation Learning on Large Graphs》
目录研究背景算法模型采样邻居顶点生成向量的伪代码聚合函数的选取参数的学习实验结果 GraphSAGE的核心: 改进方向: 其他补充学习知识归纳式与直推式为什么GCN是transduc ...
【论文阅读-NRE】Self-Supervised Representation Learning via Neighborhood-Relational Encoding
Abs: 自监督表示 NRE:neighborhood-relational encoding邻居相关编码旧的unsupervised方法一般只注重用deep net抓取data中的重点,而忽视了他 ...
【论文阅读】Rethinking Spatiotemporal Feature Learning For Video Understanding
[论文阅读]Rethinking Spatiotemporal Feature Learning For Video Understanding 这是一篇google的论文,它和之前介绍的一篇face ...
论文笔记：Evolving Losses for Unsupervised Video Representation Learning
Evolving Losses for Unsupervised Video Representation Learning 论文笔记 Distillation Knowledge Distillat ...

论文阅读：Self-supervised video representation learning with space-time cubic puzzles