论文阅读：FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks

文章目录

1. 论文总述
2. 利用CNN预测光流的另一种思路
3. 通过堆叠网络来进行光流估计优化的idea来源
4. FlyingThings3D (Things3D) dataset
5. The order of presenting training data with different properties matters.
6. 堆叠两个相同网络时（FlowNetS）
7. 堆叠不同的网络结构（FlowNetC+FlowNetS,或者3/8 通道数的FlowNetS）
8. FlowNet-CSS的训练好像很麻烦
9. Small Displacements 在传统方法里被解决的很好
10. 为小位移场景增加一个网络
11. 不同模型结果对比
参考文献

1. 论文总述

本文是FlowNet的进化版，由于FlowNet是基于CNN光流估计的开创之作，所以肯定有很多不足之处，本文FlowNet 2.0就从三个方面做了改进：

（1）数据方面：首先扩充数据集，FlyThings3D,以及侧重 small displacements的数据集ChairsSDHom；然后实验验证了不同数据集的训练顺序对模型性能也有很大影响，先学习简单数据集，再学习困难点的数据集，这样比较合理
（2）网络结构：通过在FlowNetC的基础上堆叠FlowNetS来加大模型，通过堆叠子网络来优化光流估计结果，同时在模型中引入了warp的概念：即将光流估计的中间结果wrap回到原图上（原feature），然后继续优化，即逐步缩小预测结果与GT的差距，类似机器学习中GBDT的理念，其实在传统光流估计算法中（如DIS的真正实现中），不断迭代的过程就是逐步缩小差距理念的体现。
（3）small displacements：特意为小位移场景下的光流估计设计了一个网络结构以及合成了一个小位移的数据集ChairsSDHom

First, we evaluate the influence
of dataset schedules. Interestingly, the more sophisticated
training data provided by Mayer et al. [18] leads to inferior results if used in isolation. However, a learning schedule consisting of multiple datasets improves results signifi-
cantly. In this scope, we also found that the FlowNet version
with an explicit correlation layer outperforms the version
without such layer. This is in contrast to the results reported
in Dosovitskiy et al. [10].
As a second contribution, we introduce a warping operation and show how stacking multiple networks using this
operation can significantly improve the results. By varying
the depth of the stack and the size of individual components
we obtain many network variants with different size and
runtime. This allows us to control the trade-off between accuracy and computational resources. We provide networks
for the spectrum between 8fps and 140fps.
Finally, we focus on small, subpixel motion and realworld data. To this end, we created a special training dataset
and a specialized network. We show that the architecture
trained with this dataset performs well on small motions
typical for real-world videos. To reach optimal performance
on arbitrary displacements, we add a network that learns to
fuse the former stacked network with the small displacement network in an optimal manner

2. 利用CNN预测光流的另一种思路

An alternative approach to learning-based optical flow
estimation is to use CNNs to match image patches. Thewlis
et al. [29] formulate Deep Matching [31] as a CNN and
optimize it end-to-end. Gadot & Wolf [12] and Bailer et
al. [3] learn image patch descriptors using Siamese network
architectures. These methods can reach good accuracy, but
require exhaustive matching of patches. Thus, they are restrictively slow for most practical applications. Moreover,
methods based on (small) patches are inherently unable to
use the larger whole-image context.

利用cnn进行patch的match有问题：（1）速度太慢（2）由于是局部patch，不能利用image的上下文信息（我觉得这样导致不能充分挖掘CNN的潜力）

3. 通过堆叠网络来进行光流估计优化的idea来源

CNNs trained for per-pixel prediction tasks often produce noisy or blurry results. As a remedy, off-the-shelf optimization can be applied to the network predictions (e.g.,
optical flow can be postprocessed with a variational approach [10]). In some cases, this refinement can be approximated by neural networks: Chen & Pock [9] formulate
their reaction diffusion model as a CNN and apply it to image denoising, deblocking and superresolution. Recently,
it has been shown that similar refinement can be obtained
by stacking several CNNs on top of each other. This led to
improved results in human pose estimation [17, 8] and semantic instance segmentation [22]. In this paper we adapt
the idea of stacking networks to optical flow estimation.
Our network architecture includes warping layers th

利用CNN进行像素级预测的时候，确实容易出现很多噪声或者模糊边界

4. FlyingThings3D (Things3D) dataset

The FlyingThings3D (Things3D) dataset proposed by
Mayer et al. [18] can be seen as a three-dimensional version of Chairs: 22k renderings of random scenes show 3D
models from the ShapeNet dataset [23] moving in front of
static 3D backgrounds. In contrast to Chairs, the images
show true 3D motion and lighting effects and there is more
variety among the object models

5. The order of presenting training data with different properties matters.

同时：FlowNetC outperforms FlowNetS.
推翻了FlowNet_V1 中的结论

6. 堆叠两个相同网络时（FlowNetS）

We make the following observations: (1) Just stacking
networks without warping yields better results on Chairs,
but worse on Sintel; the stacked network is over-fitting. (2)
Stacking with warping always improves results. (3) Adding
an intermediate loss after Net1 is advantageous when training the stacked network end-to-end. (4) The best results are
obtained by keeping the first network fixed and only training the second network after the warping operation.
Clearly, since the stacked network is twice as big as the
single network, over-fitting is an issue. The positive effect
of flow refinement after warping can counteract this problem, yet the best of both is obtained when the stacked networks are trained one after the other, since this avoids over-
fitting while having the benefit of flow refinement.

堆叠两个相同网络时（FlowNetS）：容易过拟合，加入warp，或者模型参数一个接一个学习（更新后一个网络的参数时，冻结前一个网络的参数），这两招都可以一定程度上避免过拟合。

文中还提到一点：以前都是在FlyingChairs上训练，在Sintel上测试，但为了查看模型是否过拟合， 作者测试时用了一部分FlyingChairs的数据。

7. 堆叠不同的网络结构（FlowNetC+FlowNetS,或者3/8 通道数的FlowNetS）

小写的s表示：3/8 通道数的FlowNetS

两个小网络可能要比一个大网络又快又好。。

8. FlowNet-CSS的训练好像很麻烦

. As also done in [17, 9], we therefore add networks
with different weights to the stack. Compared to identical
weights, stacking networks with different weights increases
the memory footprint, but does not increase the runtime. In
this case the top networks are not constrained to a general
improvement of their input, but can perform different tasks
at different stages and the stack can be trained in smaller
pieces by fixing existing networks and adding new networks

one-by-one. We do so by using the Chairs→Things3D
schedule from Section 3 for every new network and the
best configuration with warping from Section 4.1. Furthermore, we experiment with different network sizes and alternatively use FlowNetS or FlowNetC as a bootstrapping
network. We use FlowNetC only in case of the bootstrap
network, as the input to the next network is too diverse to be
properly handeled by the Siamese structure of FlowNetC.
Smaller size versions of the networks were created by taking only a fraction of the number of channels for every layer
in the network. Figure 4 shows the network accuracy and
runtime for different network sizes of a single FlowNetS.
Factor 3/8
yields a good trade-off between speed and accuracy when aiming for faster networks.

9. Small Displacements 在传统方法里被解决的很好

While the original FlowNet [10] performed well on the
Sintel benchmark, limitations in real-world applications
have become apparent. In particular, the network cannot
reliably estimate small motions (see Figure 1). This is
counter-intuitive, since small motions are easier for traditional methods, and there is no obvious reason why networks should not reach the same performance in this setting. Thus, we compared the training data to the UCF101
dataset [25] as one example of real-world data. While
Chairs are similar to Sintel, UCF101 is fundamentally different (we refer to our supplemental material for the analysis): Sintel is an action movie and as such contains many
fast movements that are difficult for traditional methods,
while the displacements we see in the UCF101 dataset are
much smaller, mostly smaller than 1 pixel. Thus, we created
a dataset in the visual style of Chairs but with very small displacements and a displacement histogram much more like
UCF101. We also added cases with a background that is
homogeneous or just consists of color gradients. We call
this dataset ChairsSDHom.

所以觉得是数据集的问题，所以专门为Small Displacements 合成了一个数据集：ChairsSDHom（按照UCF101的直方图分布）

We fine-tuned our FlowNet2-CSS network for smaller
displacements by further training the whole network
stack on a mixture of Things3D and ChairsSDHom
and by applying a non-linearity to the error to down-
weight large displacements2
. We denote this network by
FlowNet2-CSS-ft-sd. This improves results on small displacements and we found that this particular mixture does
not sacrifice performance on large displacements. However, in case of subpixel motion, noise still remains a problem and we conjecture that the FlowNet architecture might
in general not be perfect for such motion.

FlowNet2-CSS-ft-sd仅仅表示在小位移数据上微调，并没有改变模型结构。

虽然有精度提升，但仍然不够，所以还必须得改变模型结构

10. 为小位移场景增加一个网络

Therefore, we
slightly modified the original FlowNetS architecture and removed the stride 2 in the first layer. We made the beginning
of the network deeper by exchanging the 7×7 and 5×5
kernels in the beginning with multiple 3×3 kernels2
. Because noise tends to be a problem with small displacements,
we add convolutions between the upconvolutions to obtain
smoother estimates like in [18]. We denote the resulting
architecture by FlowNet2-SD; see Figure 2.

注：说是在FlowNetS上做的修改

注：小位移场景下，噪声是个大问题！！！！

有了FlowNet2-SD之后，就需要把两者做个融合，即Figure2所示的最终网络结构。

Finally, we created a small network that fuses
FlowNet2-CSS-ft-sd and FlowNet2-SD (see Figure 2). The
fusion network receives the flows, the flow magnitudes and
the errors in brightness after warping as input. It contracts
the resolution twice by a factor of 2 and expands again.
Contrary to the original FlowNet architecture it expands to
the full resolution （这块不理解） . We find that this produces crisp motion
boundaries （本文的这种方法对运动边界的效果更好？）and performs well on small as well as on large
displacements. We denote the final network as FlowNet2.

11. 不同模型结果对比

参考文献

1. FlowNet 论文笔记

2. 光流介绍以及FlowNet学习笔记

3. 图像处理中的全局优化技术(Global optimization techniques in image processing and computer vision) (三)（主要是介绍变分优化）

4. 光流Optical Flow介绍与OpenCV实现（光流可视化的C++代码）

5. FlowNet到FlowNet2.0：基于卷积神经网络的光流预测算法