[paper reading] FCOS

GitHub：Notes of Classic Detection Papers

2020.11.09更新：更新了Use Yourself，即对于本文的理解和想法，详情参见GitHub：Notes-of-Classic-Detection-Papers

本来想放到GitHub的，结果GitHub不支持公式。
没办法只能放到CSDN，但是格式也有些乱
强烈建议去GitHub：Notes-of-Classic-Detection-Papers上下载源文件，来阅读学习！！！这样阅读体验才是最好的
当然，如果有用，希望能给个star！

topic	motivation	technique	key element	math	use yourself	relativity
FCOS	Idea Contribution	FCOS Architecture Center-ness Multi-Level FPN Prediction	Prediction Head Training Sample & Label Model Output Feature Pyramid Inference Ablation Study FCN & Detection FCOS vs.vs.vs. YOLO v1	Symbol Definition Loss Function Center-ness Remap of Feature & Image	……	Related Work

文章目录

[paper reading] FCOS
- Motivation
- - Idea
  - Contribution
- Techniques
- - FCOS Architecture
  - - Advantage
  - Center-ness
  - - Idea
    - Implement
  - Multi-Level FPN Prediction
- Key Elements
- - Prediction Head
  - - Classification Branch
    - Regression Branch
    - Shared Head
  - Training Sample & Label
  - - Training Sample
    - Label Pos/Neg
  - Model Output
  - - 4D Vector t∗\pmb t^*ttt∗
    - C-Dimension Vector p\pmb pppp
  - Feature Pyramid
  - Inference
  - Ablation Study
  - - Multi-Level FPN Prediction
    - Ambiguity Samples
    - With or Without Center-ness
  - FCN & Detection
  - FCOS vs.vs.vs. YOLO v1
- Math
- - Symbol Definition
  - Loss Function
  - Center-ness
  - Remap of Feature & Image
- Use Yourself
- Related Work
- - Drawbacks of Anchor
  - DenseBox-Based
  - Anchor-Based Detector
  - YOLO v1
  - - Idea
    - Drawbacks of Points near Center
  - CornerNet
  - - Steps
    - Drawbacks of Corner
    - Drawbacks of Points near Center
  - CornerNet
  - - Steps
    - Drawbacks of Corner

Motivation

Idea

用per-pixel prediction的方法进行object detection（通过fully convolution实现）

Contribution

将detection重新表述为per-pixel prediction
使用 multi-level prediction：
- 提升recall
- 解决重叠bounding box带来的ambiguity
center-ness branch

抑制bounding box中的low-quality prediction

Techniques

FCOS Architecture

backbone的默认选择是ResNet-50

Advantage

详见 [Drawbacks of Anchor](#Drawbacks of Anchor)

Detection与FCN-solvable task (e.g. semantic segmentation) unify到一起

一些其他任务的Idea也可以迁移到detection中（re-use idea）
anchor-free & proposal-free
消除了与anchor相关的**复杂计算 **(e.g. IoU)

获得 faster training & testing，less training memory footprint
在one-stage中做到了SOTA，可以用于替换PRN
可以快速迁移到其他的vision task (e.g. instance segmentation, key-point detection)

Center-ness

center-ness是对每个location进行预测

可以极大地提高性能

Idea

在远离center的位置会产生大量的low-quality的predicted bounding box

FCOS引入了center-ness，抑制远离center的low-quality bounding box（e.g. down-weight）

Implement

引入center-ness branch，来预测location的center-ness

在测试时：

通过下式计算score：
Final Score=Classification Score×Center-ness\text{Final Score} = \text{Classification Score} × \text{Center-ness} Final Score=Classification Score×Center-ness
使用NMS滤除被抑制的bounding box

Multi-Level FPN Prediction

Multi-Level FPN Prediction能解决2个问题：

Best Possible Recall

将FCOS的Best Possible Recall提升到SOTA

Ambiguity of Ground-Truth Box Overlap

解决ground-truth box重叠带来的ambiguity，达到anchor-based程度

原因：绝大部分情况下，发生重叠的object，尺度差距都很大

Idea是：根据regress distance的不同，将需要回归的location分发到不同的feature level

具体来说：

计算regress target
根据feature level的maximum regress distance，筛选出positive sample
mi−1<max(l∗,t∗,r∗,b∗)<mim_{i-1} < \text{max}( l^*, t^*,r^*,b^* ) < m_i mi−1<max(l∗,t∗,r∗,b∗)<mi
其中 mim_imi 是feature level iii 需要regress的maximum distance
{m2,m3,m4,m5,m6,m7}={0,64,128,256,512}\{m_2, m_3, m_4, m_5, m_6, m_7\} = \{ 0,64,128,256,512 \} {m2,m3,m4,m5,m6,m7}={0,64,128,256,512}

相比于原始的FPN（e.g. SSD），FCOS将不同scale的object“隐性“地分到了不同的feature level（small在浅层，large在深层）

我认为这可以看做更精致的手工设计
若1个location落到2个ground-truth box（e.g. ambiguity）中，则选择small box进行regression（更偏向小目标）

Key Elements

Prediction Head

Classification Branch

Regression Branch

由于regression target永远是正数，所以在regression branch的top续上exp(sixs_ixsix)（详见 [Shared Head](#Shared Head)）

Shared Head

在不同feature level共享head

Advantage：

parameter efficient
improve performance

Drawback：

由于[Multi-Level FPN Prediction](#Multi-Level FPN Prediction)的使用，会使得不同feature level的输出范围有所不同（e.g. [0, 64] for P3P_3P3，[64, 128] for P4P_4P4）

为了使得identical heads可以用于different feature level：
exp(x)→exp(six)\text{exp}(x) \rightarrow \text{exp}(s_ix) exp(x)→exp(six)

sis_isi ：trainable scaler，用来自动调整exp的base

Training Sample & Label

Training Sample

直接将location作为training sample（这和语义分割的FCN相似）

Label Pos/Neg

location (x,y)(x,y)(x,y) 为正样本的条件为：

location (x,y)(x,y)(x,y) 落在 ground-truth box中
location (x,y)(x,y)(x,y) 的类别 == 该ground-truth box****的类别

FCOS使用了尽可能多的foreground sample来训练（e.g. ground-truth box的全部location）

而不像anchor-based仅选用与ground-truth box高的作为正样本

也不像[CenterNet (Object as Points)](./[paper reading] CenterNet (Object as Points).md) 只对geometric center作为正样本

Model Output

对每个level的feature map的每个location有如下的输出：

4D Vector t∗\pmb t^*ttt∗

t∗=(l∗,t∗,r∗,b∗)\pmb t^* = (l^*,t^*,r^*,b^*) ttt∗=(l∗,t∗,r∗,b∗)

描述了bounding box的4个side相对于该location的relative offset

具体来说：

注意：

FCOS是对ground-truth box的每个location进行计算（并不仅仅是geometric center），所以需要预测4个量来获得boundary

像 [CenterNet (Object as Points)](./[paper reading] CenterNet (Object as Points).md) 只对geometric center进行预测，2个量就够了

注意：object重叠的问题可以通过 [Multi-Level FPN Prediction](#Multi-Level FPN Prediction) 解决。但如果仍发生重叠，则优先考虑小样本（选择面积最小的bounding box）

C-Dimension Vector p\pmb pppp

实验中使用的不是 C-class classifier，而是 C 个binary classifier

Feature Pyramid

定义了5个level的feature map：{P3,P4,P5,P6,P7}\{P_3, P_4, P_5, P_6, P_7\}{P3,P4,P5,P6,P7}（步长为 {8,16,32,64,128}\{8,16,32,64,128\}{8,16,32,64,128}）

{P3,P4,P5}\{P_3,P_4,P_5\}{P3,P4,P5}：backbone的feature map {Cc,C4,C5}\{ C_c, C_4,C_5\}{Cc,C4,C5} + 1×1 Convlution
{P6,P7}\{P_6,P_7\}{P6,P7} ：P5P_5P5 & P6P_6P6 经过stride=2的卷积层

Inference

将image输入network，在feature map FiF_iFi 的每个location获得：
- classification score px,y\pmb p_{x,y}pppx,y
- regression prediction tx,y\pmb t_{x,y}tttx,y
选择 px,y>0.5\pmb p_{x,y} > 0.5pppx,y>0.5 的location，作为positive sample
decode得到bounding box的coordinate

Ablation Study

Multi-Level FPN Prediction

结论：

Best Possible Recall 并不是FCOS的问题
Multi-Level FPN Prediction可以提高Best Possible Recall

Ambiguity Samples

结论：

Multi-Level FPN Prediction可以解决Ambiguity Samples的问题

即：大部分的overlap ambiguity会被分到不同的feature level，只有极少的ambiguity location还存在

With or Without Center-ness

Center-ness能抑制远离center的low-quality bounding box，从而大幅度提高AP
center-ness必须具有单独的支路

FCN & Detection

FCN 主要用于 dense prediction

其实fundamental vision task都可以unify到one single framework

而anchor的使用，实际上使得Detecntion任务偏离了neat fully convolutional per-pixel prediction framework

FCOS vs.vs.vs. YOLO v1

相比于YOLO v1只使用靠近center的point进行prediction，FCOS使用ground-truth的全部点进行prediction

对于产生的low-quality bounding box，由center-ness进行抑制

使得FCOS可以达到anchor-based detectors相近的recall

Math

Symbol Definition

Fi∈RH×W×CF_i \in \mathbb{R} ^{H×W×C}Fi∈RH×W×C ：backbone中第 iii 层的feature map
sss ：到该层的total stride
{Bi}\{B_i\}{Bi} ：ground-truth box
Bi=(x0(i),y0(i),x1(i),y1(i),x(i))∈R4×{1,2...C}B_i = (x_0^{(i)}, y_0^{(i)},x_1^{(i)},y_1^{(i)},x^{(i)}) \in \mathbb{R}^4 × \{ 1,2...C\} Bi=(x0(i),y0(i),x1(i),y1(i),x(i))∈R4×{1,2...C}
- (x0(i),y0(i))(x_0^{(i)}, y_0^{(i)})(x0(i),y0(i)) ：top-left corner coordinate
- (x1(i),y1(i))(x_1^{(i)}, y_1^{(i)})(x1(i),y1(i)) ：bottom-right corner coordinate
- c(i)c^{(i)}c(i) ：bounding box中object的class
- CCC ：number of class

Loss Function

λ=1\lambda = 1λ=1

还缺少一个Center-ness Loss，其为binary cross entropy

该损失在feature map的全部location上计算，具体来说：

Classification Loss在全部location上计算（positive & negative）
Classification Loss在positive location上计算

1{Cx,y∗>0}=1\mathbb{1} _{\{ C_{x,y}^* > 0\}} = 11{Cx,y∗>0}=1 if ci∗>0c_i^*>0ci∗>0

Center-ness

centerness* =min⁡(l∗,r∗)max⁡(l∗,r∗)×min⁡(t∗,b∗)max⁡(t∗,b∗)\text { centerness* }=\sqrt{\frac{\min \left(l^{*}, r^{*}\right)}{\max \left(l^{*}, r^{*}\right)} \times \frac{\min \left(t^{*}, b^{*}\right)}{\max \left(t^{*}, b^{*}\right)}} centerness* =max(l∗,r∗)min(l∗,r∗)×max(t∗,b∗)min(t∗,b∗)

center-ness反映location到对应center的normalized distance
使用“根号”来延缓center-ness的衰减
center-ness的范围为 [0,1]

Remap of Feature & Image

feature map上的 (x,y)(x,y)(x,y)，映射回原图像为：
(⌊s2⌋+xs,⌊s2⌋+ys)\big( \lfloor \frac s2 \rfloor + xs , \lfloor \frac s2 \rfloor + ys \big) (⌊2s⌋+xs,⌊2s⌋+ys)
该位置会靠近location (x,y)(x,y)(x,y) 的对应的reception field的center

Use Yourself

Related Work

Drawbacks of Anchor

detection performance对size、aspect ratio、number of anchor等超参数敏感

即anchor需要精密的手工设计
需要大量的anchor，才能获得high recall rate

这会导致训练时极端的正负样本不均衡
anchor会伴随着复杂的计算

比如IoU的计算
anchor的size、aspect ratios都是预先定义的，导致无法应对shape variations（尤其对于小目标）

另外，anchor这种“预先定义”的形式也会影响模型的泛化能力。换句话说，设计的anchor是task-specific

DenseBox-Based

对image进行crop和resize，以处理不同size的bounding box

导致DenseBox必须在image pyramid上进行detection

这与FCN仅计算一次convolution的思想相悖
仅仅用于特定的domain，难以处理重叠的object

因为无法确定对应pixel回归到哪一个object
Recall比较低

Anchor-Based Detector

来源：

sliding window 和 proposal based detectors
anchor的本质

预定义的sliding window (proposal) + offset regression
anchor的作用

作为detector的训练数据
典型model
- Faster-RCNN
- SSD
- YOLO v2

YOLO v1

YOLO v1是典型的Anchor-Free Detector

Idea

YOLO v1使用靠近center的point来预测bounding box

即：object的center落到哪个grid cell，则由该cell负责预测该object的bounding box

这是因为：靠近center的points能生成质量更高的detection

Drawbacks of Points near Center

只使用靠近center的points，会导致low-racall

正因如此，YOLO v2 又重新使用了anchor

CornerNet

CornerNet是典型的Anchor-Free Detector

Steps

corner detection
corner grouping
post-processing

Drawbacks of Corner

unding box**

即：object的center落到哪个grid cell，则由该cell负责预测该object的bounding box

这是因为：靠近center的points能生成质量更高的detection

Drawbacks of Points near Center

只使用靠近center的points，会导致low-racall

正因如此，YOLO v2 又重新使用了anchor

CornerNet

CornerNet是典型的Anchor-Free Detector

Steps

corner detection
corner grouping
post-processing

Drawbacks of Corner

post-processing复杂，需要额外的distance metric