论文名称:Self-supervised video representation learning with space-time cubic puzzles(2019 AAAI)

论文作者:Dahun Kim, Donghyeon Cho, In So Kweon

下载地址:https://ojs.aaai.org/index.php/AAAI/article/view/4873


Contributions

In this paper, we introduce a new self-supervised task called as Space-Time Cubic Puzzles to train 3D CNNs (3D ResNet-18) using large scale video dataset. Given a randomly permuted 3D spatio-temporal crops extracted from each video clips, we train a network to predict their original spatio-temporal arrangement. By completing Space-Time Cubic Puzzles, the network learns self-supervised video representation from unlabeled videos. In experiments, we demonstrate that our learned 3D representation is well transferred to action recognition tasks, and outperforms state-of-the-art 2D CNN-based competitors on UCF101 and HMDB51 datasets.


Method

1、Pretext Task: Space-Time Cubic Puzzles

To generate the puzzle pieces, we consider a spatio- temporal cuboid consisting of 2 × 2 × 4 grid cells for each video. Then, we sample 4 crops instead of 16, in either spatial or temporal dimension. More specifically, the 3D crops are extracted from a 4-cell grid of shape 2×2×1 (colored in blue) or 1 × 1 × 4 (colored in red) along the spatial or temporal dimension respectively.

2、Network

We use a late-fusion architecture. It is a 4-tower siamese network, where the towers share the same parameters, and follow the 3D ResNet architecture. Each 3D crops are processed separately until the fully-connected layer. Furthermore, each towers are agnostic of whether it was spatial or temporal dimension the input crops had been sampled from. Similar to the jigsaw puzzle problem, we formulate the rearrangement problem as a multi-class classification task. In practice, for each tuple of four crops, we flip all the frames upside-down with 50% probability, doubling the number of classes to 48 (that is, 2×4!) to further boost our performance.

3、Avoiding Trivial Learning

When designing a pretext task, it is crucial to ensure that the task forces the network to learn the desired semantic structure, without bypassing the understanding by finding low-level clues that reveal the location of a video crop. So, we choose channel replication as our data preprocessing. Another often-cited worry in all context-based works relates to trivial low-level boundary pattern completion. Thus, we apply spatio-temporal jittering when extracting each video crops from the grid cells to avoid the trivial cases

4、Implementation Details

We use video clips with 224 × 224 pixel frames and convert every video file into PNG images in our experiments. We sample 128 consecutive frames from each clip, and split them into 2 × 2 × 4-cell grid; That is, one grid cell consists of 112 × 112 × 32 pixels, and for each cell, we sample 80×80×16 pixels with random jittering to generate a 3D video crop. During the fine-tuning and testing, we randomly sample 16 consecutive frames for each clip, and spatially resize the frames at 112×112 pixels. In testing, we adopt the sliding window manner to generate input clips, so that each video is split into non-overlapped 16-frame clips. The clip class scores are averaged over all the clips of the video.


Results

论文阅读:Self-supervised video representation learning with space-time cubic puzzles相关推荐

  1. 论文阅读:Self-Supervised Video Representation Learning With Odd-One-Out Networks

    目录 Contributions Method 1.Model 2.Three sampling strategies. 3.Video frame encoding. Results More Re ...

  2. 论文阅读:Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting

    题目:Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting 作者:Marti ...

  3. 【论文阅读】InfoGAN: Interpretable Representation Learning by Information Maximizing GAN

    论文下载 bib: @inproceedings{chenduan2016infogan,author = {Xi Chen and Yan Duan and Rein Houthooft and J ...

  4. 论文阅读之Improved Word Representation Learning with Sememes(2017)

    文章目录 论文介绍 Conventional Skip-gram Model Simple Sememe Aggregation Model(SSA) Sememe Attention over Co ...

  5. 论文阅读——Mockingjay: unsupervised speech representation learning

    <Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Enco ...

  6. 论文阅读GraphSAGE《Inductive Representation Learning on Large Graphs》

    目录 研究背景 算法模型 采样邻居顶点 生成向量的伪代码 聚合函数的选取 参数的学习 实验结果 GraphSAGE的核心: 改进方向: 其他补充学习知识 归纳式与直推式 为什么GCN是transduc ...

  7. 【论文阅读-NRE】Self-Supervised Representation Learning via Neighborhood-Relational Encoding

    Abs: 自监督表示 NRE:neighborhood-relational encoding邻居相关编码 旧的unsupervised方法一般只注重用deep net抓取data中的重点,而忽视了他 ...

  8. 【论文阅读】Rethinking Spatiotemporal Feature Learning For Video Understanding

    [论文阅读]Rethinking Spatiotemporal Feature Learning For Video Understanding 这是一篇google的论文,它和之前介绍的一篇face ...

  9. 论文笔记:Evolving Losses for Unsupervised Video Representation Learning

    Evolving Losses for Unsupervised Video Representation Learning 论文笔记 Distillation Knowledge Distillat ...

最新文章

  1. Python 字典(Dictionary)
  2. 余额宝技术架构读后感
  3. centos 没有可用的网络设备
  4. ActiveX 控件导入程序
  5. 算法提高课-图论-欧拉回路和欧拉路径-AcWing 1124. 骑马修栅栏:欧拉路径、dfs
  6. 使用JSP的标准标签库JSTL处理XML格式的数据
  7. python散点矩阵图_用python-pandas作图矩阵
  8. python多久可以精通_学Python需要多久能学会?精通Python需要多长时间?
  9. java 多态判断非空_Java 多态
  10. iOS 动画之Spring动画、Block动画、GIF图
  11. android design support library最新版,总结一下现在关于Design Support Library的几个博客...
  12. python根据url下载视频_Python爬取某视频并下载
  13. 移动通信基础(4)信道模型
  14. matlab如何用二分法求函数零点,用二分法求函数的零点及二分法定义
  15. tampermonkey(油猴) GM_addStyle
  16. Chrome 页面呈现原理与性能优化(内附分享 ppt)
  17. python 减法函数_python之函数
  18. 单相全桥逆变电路MATLAB仿真,原理图设计,单相全桥逆变器设计资料,ti的参考,可用做光伏并网逆变器
  19. 链家混三个月底薪_深圳链家正式入职,我想对应届毕业生说
  20. trackingmore快递查询平台_快递查询API接口(trackingmore)

热门文章

  1. Helm模板常用语法介绍与简单应用场景
  2. css一些美化页面的方法
  3. FMEA软件-FMEA的价值与局限性
  4. 记住要仰望星空,不要低头看脚下!
  5. java火车票预订系统代码_基于JSP开发火车票网上订票系统 java源码
  6. Oracle11g客户端 Client与Database 的下载、安装、与配置(逐步详解)
  7. canvas绘制一个圆形
  8. Java 工厂方法模式
  9. Python+Vue计算机毕业设计北理珠青协志愿素拓系统eaa9n(源码+程序+LW+部署)
  10. 线性代数:秩的各章节串烧、秩的所有公式总结、秩的常用结论及其推论、线性相关与线性表示线性表出和秩的关系