Advances in adversarial attacks and defenses in computer vision: A survey论文解读

Abstract

深度学习由于其能准确解决复杂问题的能力，现在被广泛应用于计算机视觉中。然而，现在已知DL对于对抗攻击来说是十分脆弱的。对抗攻击通过通过在视频或图片中加入视觉上无法察觉的微小扰动来改变模型的输出。自从2013[1]年发现了这种现象后，这个研究方向已经吸引了大量注意力。在[2]中，我们回顾了2018年之前对深度学习的对抗性攻击（以及防御）中的进展。这些进展激励了对抗领域中的新方向，并在这些新方向上出现了许多新的攻击方法，这些新方法相比上一代方法明显更加成熟。因此，作为[2]的续集，这篇综述关注于对抗领域2018年后的新进展。为了确保真实性，我们主要考虑发表在著名的计算机视觉和机器学习研究源上贡献。除了给出一个详尽的综述外，这篇文章同样提供了对于领域内术语的精确定义。最后，这篇文章讨论了这个领域的挑战以及未来展望。

1 Introduction

深度学习 (DL) [3] 能在大规模数据集上构建精准复杂的数学模型。它在近些年已经为研究者在机器智能方向上提供了大量的突破。从分析DNA突变[4]到重建大脑回路[5]和探索细胞数据[6]；深度学习方法目前正在提高我们对许多前沿科学问题的认识。因此，机器智能的多个子领域正在迅速采用深度学习作为解决问题的有效工具。除了语音识别[7]和自然语言处理[8]，计算机视觉也是当前严重依赖深度学习的子领域之一。

Krizhevsky等人[9]在2012年的开创性工作触发了计算机视觉中深度学习应用的兴起，他们的工作报告了使用卷积神经网络（CNN）[11]对图像识别任务[10]的创纪录性能改进。自[9]以来，计算机视觉界对深度学习研究做出了重大贡献，这导致了功能越来越强大的神经网络[12]、[13]、[14]，其结构包含多层架构，从而确立了“深度”学习的本质。计算机视觉领域的进步也使深度学习能够解决人工智能（AI）的复杂问题。例如，现代人工智能的最高成就之一，即tabla-rasa学习[15]应归功于起源于计算机视觉领域的残差学习[12]。

由于深度学习的优秀表现[15]，基于计算机视觉的人工智能被认为已经足够成熟，可以在安全和安保关键系统中部署。汽车驾驶员[18]、自动取款机中的人脸识别[19]和移动设备的人脸识别技术[20]是一些深度学习部署在现实世界的例子，但近期发现的深度学习对于对抗攻击的脆弱性对这种大规模应用的安全性提出了极大的挑战[1]。

Szegedy et al. [1]发现了深度神经网络的预测结果可以被输入数据上极小的扰动改变。对于图片来说，这些扰动可以被限制为对人眼视觉不可见的程度，如图1所示。最初，人们仅在图像分类任务中应用对抗攻击 [1]。然而现在对抗攻击被广泛应用于各种计算机视觉任务中，例如语义分割[27], [28]; 目标检测 [29], [30]; 以及目标追踪 [31], [32]。这些文献突出了对抗攻击的许多特征，这些特征使得对抗攻击成为对深度学习实际部署的真正威胁。例如，经常观测到的一个现象是被攻击的模型通常会对处理后的图像以很高的置信度输出错误结果[2],[17]。另外发现的一点是同样的扰动通常可以欺骗多个模型[33],[34]。文献中还提及了一种称为通用扰动的对抗扰动，这些扰动可以被添加到“任意”图片中，以很高的置信度让模型给出错误结果[35],[36]。以上这些事实对于深度学习在安全敏感方向的部署有很大的影响。

由于对抗攻击的重要特性，对抗扰动在过去五年间受到了非常多的关注。综述[2]中介绍的是2018年之前的工作，这些工作中的一大部分都可以被视为the first-generation techniques that explore the core algorithms and techniques to fool deep learning or defend it against the adversarial attacks. Some of those algorithms have inspired streams of followup methods that further refine and adapt the core attack and defense techniques. These second-generation methods are also found to focus more on other vision tasks instead of just the classification problem, which is the main topic of interest in early contributions in this direction.

2 Definition of Terms

3 Adversarial Attacks：The formal problem

$M(.)\mathcal{M}(.)$ 表示目标深度模型，其判别过程为 $M(I):I→l\mathcal{M}(I):I\to l$ ，其中 $I∈RmI\in\mathbb{R}^m$ 表示输入图片， $l∈Z+l\in\mathbb{Z}^+$ 表示模型输出。对抗攻击的目标就是寻找某个信号 $ρ∈Rm\rho\in\mathbb{R}^m$ 使得 $M(I+ρ)→l~\mathcal{M}(I+\rho)\to\tilde{l}$ ，这里 $l~≠l\tilde{l}\ne l$ 。为了确保对原始图片的改动对人眼不可见，扰动 $ρ\rho$ 需要加上范数限定，例如设定为 $∥ρ∥p<η\Vert\rho\Vert_p<\eta$ ，这里 $∥.∥p\Vert.\Vert_p$ 指代一个向量的 $l_p$ 范数， $η\eta$ 是一个预定义的标量。准确地讲，对抗攻击的整体过程可以用如下公式进行描述：

$M(I+ρ)→l~s.t.l~≠l,∥ρ∥p<η(1)\mathcal{M}(I+\rho)\to\tilde{l}\quad s.t.\ \tilde{l}\ne l,\ \Vert\rho\Vert_p<\eta\quad\quad\quad\quad\quad\quad(1)$

上面的公式化表示代表了目前对于对抗攻击最普遍的理解。然而，它并不包括所有的攻击。例如， unrestricted adversarial examples [43], [44], 这里攻击者既不限定于操纵原始图像（例如图像本身是可以变换的）也不限定于需要小于某个特定的范数值，这就不能使用（1）中的约束来进行描述。类似地，在图像中添加局部但是可感知的对抗扰动也无法使用（1）进行说明。因此，为了使得定义更广泛，我们考虑如下的约束：

$M(I~)→l~s.t.l~≠l,I~∈SI,M(I∼{SI−I~})=l(2)\mathcal{M}(\tilde{I})\to\tilde{l}\quad s.t.\ \tilde{l}\ne l,\ \tilde{I}\in\mathcal{S}_I,\ \mathcal{M}(I\sim\{\mathcal{S}_I-\tilde{I}\})=l\quad\quad\quad\quad\quad\quad(2)$

where SI is the set of images perceived as clean or allowed
by humans to produce the desired output `. For the sake of
brevity, we are assuming a single adversarial sample in SI in
(2). The conventional view of additive perturbations (in Eq. 1)
becomes a special case of this constraint where ˜I = I+ρ and
˜I ∈ SI is ensured by restricting the perturbation norm. Since
(2) does not deal with ρ explicitly, one must articulate any
additional constraint over ρ to specify an attack under (2) - as
we have done above for the imperceptible perturbation.

Adversarial examples for deep visual models were originally
discovered for the image classification task [1], where additive
perturbations were used to launch the attack. Consequently,
a vast majority of the existing attacks leverage some form
of the additive perturbations to manipulate the model output.
Moreover, image classifiers still remain the most popular
target models for attacks. This trend partially owes to the fact
that classification is one of the fundamental tasks in pattern
recognition. Thus, it is important to explicitly, though briefly,
discuss the broad concept of adversarial attacks on deep image
classifiers under the above formulation.

For the image classifiers, an output is a class label ∈ Z +. The nature of the task makes it more interesting to change this label to a pre-specified incorrect label ˜ ∈ Z

by the
attack, which motivates the targeted adversarial attacks on
classifiers. A non-targeted attack on a classifier can also be
considered as a special case of the targeted attacks, where ˜`
is chosen at random. Whereas image-specific attacks lead to
misclassification of individual images, it is also possible to
compute additive perturbations ρ that cause incorrect label
predictions on a large number of images. Such universal
perturbations were first reported by Moosavi-Dezfooli [35].
Here, we discuss the notions of image-specific vs universal,
and targeted vs non-targeted attacks in the context of classifiers
for a clear understanding of the text to follow immediately.
Nevertheless, these concepts are more general and can also be
applied to other computer vision tasks

4 First-generation attacks

第一代对抗攻击包括2018年以前的一些具有影响力的攻击，这些方法启发了许多后续的方法。这些攻击更多聚焦于计算对抗图像的基本算法，并且使用分类任务进行测试。

4.1 L-BFGS攻击

Szegedy et al. [1] 首先通过解决如下的优化问题发现了深度视觉模型对于对抗扰动的脆弱性：

$min⁡ρ∥ρ∥2s.t.M(I+ρ)=l~;I+ρ∈[0,1]m(3)\min\limits_{\rho}\Vert\rho\Vert_2\quad s.t.\ \mathcal{M}(I+\rho)=\tilde{l};\ I+\rho\in[0,1]^m\quad\quad\quad\quad\quad(3)$

上述问题求解十分困难，因此Szegedy et al.等人使用Limited Memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) algorithm来进行了近似求解，这是一种拟牛顿算法[46]。为了解决（3），他们使用拉格朗日乘子 $c$ 将限制条件 $min⁡ρ∥ρ∥2\min\limits_{\rho}\Vert\rho\Vert_2$ 和优化问题结合在一起，问题的解是估计最小的 $c > 0$ 的同时寻找问题（4）中的最小 $ρ\rho$ 值满足 $M(I+ρ)=l~\mathcal{M}(I+\rho)=\tilde{l}$ 。

$min⁡ρc∣ρ∣+L(I+ρ,l~)s.t.I+ρ∈[0,1]m(4)\min\limits_{\rho}\quad c|\rho|+\mathcal{L}(I+\rho,\tilde{l})\quad s.t.\ I+\rho\in[0,1]^m\quad\quad\quad\quad(4)$

这里 $L(.,.)\mathcal{L}(.,.)$ 指代判别器损失。对于一张干净图片，加上通过 $(4)$ 获取到的扰动后仍然对人眼视觉系统不可见，如图3所示。

这个发现对视觉领域的研究有着意义深远的影响，此前大家认为深度模型特征能很好的近似用欧式距离衡量的图像的感知差异。发现对抗扰动可以以在欧式范数极小的情况下改变神经网络的输出改变了这一观点。Szegedy et al. 同样阐述了他们的对抗攻击在不同的深度视觉分类器上转移性很好。

4.2 FGSM 攻击

Szegedy et al. [1] 首先发现了将对抗图片加入到模型的训练集中能提高它对于对抗样本的鲁棒性，这一观察结果是文献中提及的对抗训练的基本原理。然而，在大规模图片上计算（4）花销过高。这启发了Fast Gradient Sign Method (FGSM) [17]来有效计算对抗扰动：

$ρ=ϵsign(∇IJθ(I,l))(5)\rho=\epsilon\ \text{sign}(\nabla_I\mathcal{J}_{\theta}(I,l))\quad\quad\quad\quad\quad(5)$

这里 $Jθ(.,.)\mathcal{J}_{\theta}(.,.)$ 是模型参数为 $θ\theta$ 的cost， $∇I\nabla_I$ 计算了cost关于 $I$ 的梯度， $sign(.)\text{sign}(.)$ 指代作用于向量中每个元素的sign function， $ϵ\epsilon$ is a prefixed scalar value用于控制扰动可探测性。最终的对抗图片通过 $I~=I+ρ\tilde{I}=I+\rho$ 计算得到。

FGSM是一个单步并且基于梯度攻击方式, 相比于实现高欺骗率，这种计算对抗扰动的方法更关注计算效率。 Goodfellow et al. [17] 也使用了这种攻击来验证他们的线性假说。线性假说将神经网络在高维空间上的线性表现（由ReLU等激活函数引起）视为它们易受到对抗扰动攻击的原因。他们同样支持将线性表现作为攻击在不同模型间存在转移性的主要原因之一，因为网络结构普遍允许这种线性性质来提高训练效率。线性假说和当时正兴起的，将复杂神经网络高“非线性”性质视为对抗脆弱性的原因正好相反。

FGSM [17] 攻击是现有的最有影响力的攻击方法之一，尤其在白盒设定下。它的核心思想，即对模型损失进行梯度上升是许多攻击方法的出发点。例如Rozsa et al. [47]的Fast Gradient Value Method (FGVM) 主要移除了（5）中的符号函数来发起攻击。类似地，忽略掉符号函数, Miyato et al. [48] 将梯度用 $l_2$ 范数进行正则化来发起攻击。Kurakin et al. [49]同样分析了使用 $l∞l_{\infty}$ 范数来进行正则化。他们将FGSM拓展到了I-FGSM-即它的迭代版本变体。Dong et al. [50]后续在迭代过程中引入了momentum方法，强化了I-FGSM方法，这被称为Momentum Iterative (MI-)FGSM. Diverse Input I-FGSM即DI2-FGSM [51] 是另一个直接构建于FGSM算法的攻击例子，[51]的核心思想是通过进行图片变换（例如以固定概率随机改变图像大小或进行扩充）多样化每轮迭代FGSM迭代中使用的输入图片。这种多样化声称能在黑盒设定下有更高的转移性，作者随后利用momentum将DI2-FGSM拓展到了M-DI2 -FGSM。

4.3 BIM以及ILCM攻击

Basic Iterative Method (BIM) [49] 是另外一个有影响力的攻击方式，同时由这篇文章引入了物理世界攻击。这种攻击方式基本上就是迭代FGSM算法，通过进行如下计算来获取对抗图片：

$I~i+1=Clipϵ{I~i+αsign(∇IJθ(I~i,l)}(6)\tilde{I}_{i+1}=Clip_{\epsilon}\{\tilde{I}_i+\alpha\ \text{sign}(\nabla_I\mathcal{J}_{\theta}(\tilde{I}_i,l)\}\quad\quad\quad\quad(6)$

这里 $i$ 表示第 $i$ 轮迭代, $Clipϵ{.}Clip_{\epsilon}\{.\}$ 表示将扰动限定在 $ϵ\epsilon$ 球体范围内， $α\alpha$ 是一个预定义的标量。Kurakin et al. [49]通过使用在真实世界中打印出的对抗图像成功欺骗了在ImageNet上训练的Inception模型，如图4所示。这个想法启发了后续的真实世界攻击。目标攻击的概念可以追溯到[49]和[53]，其中表明，通过修改(6)，将加法改为减法，并将 $l$ 替换为 $l~\tilde{l}$ ，可以最大限度地提高模型对于目标类的预测概率。公式表示如下:

$I~i+1=Clipϵ{I~i−αsign(∇IJθ(I~i,l~))}(7)\tilde{I}_{i+1}=Clip_{\epsilon}\{\tilde{I}_i-\alpha\ \text{sign}(\nabla_I\mathcal{J}_{\theta}(\tilde{I}_i,\tilde{l}))\}\quad\quad\quad\quad\quad(7)$

对于使用交叉熵作为损失的分类器，求解（7）可使图像 $I~\tilde{I}$ 的模型在 $l~\tilde{l}$ 上的置信度最大化。最初，作者建议使用干净图像（如模型预测的）的最小可能分类结果作为标签 $l~\tilde{l}$ 来计算欺骗结果。因此，该技术也称为Iterateive Least-likely Class Method（ILCM）。

4.4 PGD攻击

The Projected Gradient Descent (PGD) attack被广泛认为是最强的攻击之一，Madry et al. [54]的工作可以认为是PGD攻击的起源。然而，Madry et al. 同样将迭代FGSM ([49], [53]) 作为一种PGD方法，因为Projected Gradient Descent是一种将梯度投影到一个球中的标准优化方法。特别地，作者将iterative FGSM视为 $l∞l_{\infty}$ 限定的PGD, 此时扰动的 $l∞l_{\infty}$ 范数被剪裁（clipping）操作（即投影）限定。[54]的主要贡献在于通过优化的视角来看待深层模型在对抗角度的鲁棒性，从而将深层模型的对抗性训练定义为如下的最小-最大优化问题，如下所示：

$min⁡θρ(θ),s.t.ρ(θ)=E(I,l)∼I[max⁡ρL(θ,I~,l)](8)\min\limits_{\theta}\rho(\theta),\quad s.t. \quad \rho(\theta)=\mathbb{E}_{(I,l)\sim\mathcal{I}}[\max\limits_{\rho}\mathcal{L}(\theta,\tilde{I},l)]\quad\quad\quad\quad(8)$

这里 $E[.]\mathbb{E}[.]$ 指期望， $I\mathcal{I}$ 指在输入图片上的分布。从这个角度看作者可以将PGD判别成可能是最强的攻击。

从上述视角看，我们同样可以将之前讨论的I-FGSM的变体作为PGD的变体。反过来，PGD同样可以和FGSM建立关联。然而Madry et al. [54]的一个重要发现使PGD在对象训练中比FGSM更有吸引力。这就是‘标签泄露’现象，这在基于FGSM的对抗训练[53]中观测到, 但是没有在基于PGD的对抗训练中出现。简单来说，当对抗训练模型最终对抗图像的预测精度高于干净图像时，标签泄露就会发生。 FGSM会使用一组有限的对抗样本，这使得对抗训练中发生过拟合，因此最终导致了泄露。考虑到FGSM的主要目标是计算对抗样本来更好地进行对抗训练，因此避免标签泄露是PGD的一个显著优势。 Madry et al. [54]同样表明使用PGD进行对抗训练会自动使得模型对较弱的攻击具有鲁棒性。然而，作为一种迭代方法，PGD的计算成本很高。

4.5 JSMA以及One-pixel攻击

大多数早期的攻击关注于在扰动的 $l_2$ 或 $l∞l_{\infty}$ 范数受限的情况下整体扰动一个干净样本。但是Jacobian-based Saliency Map Attack (JSMA) [55] 以及One-pixel attack [56] 和这些攻击不同，它们将扰动限定为图片中的一个较小的区域。和FGSM及它的各种变体不同，JSMA没有利用网络的反向梯度来估计扰动，而是使用一个网络 $M(.)\mathcal{M}(.)$ 的前向梯度：

$∇M(I)=∂M(I)∂I=[∂Mj(I)∂xi](9)\nabla\mathcal{M}(I)=\frac{\partial\mathcal{M}(I)}{\partial I}=[\frac{\partial\mathcal{M}_j(I)}{\partial x_i}]\quad\quad\quad\quad\quad(9)$

这里 $j∈1,…,Mj\in 1,\dots,M$ 指代 $M(.)\mathcal{M}(.)$ 所代表的M维方程 $i∈1,…,N\ i\in1,\dots,N$ 指代 $I$ 的N维向量表示，这里第 $i$ 个元素使用 $x_i$ 表示。本质上，（9）计算了神经网络学习到的函数的雅可比矩阵。

Su et al. [56] 等人阐述了将扰动限定在单个像素上也可以欺骗深度学习模型。然而，这通常对较小的图像尺寸有效，例如 $64×6464\times 64$ 。他们使用差分演化(DE) [58]来估计要在图像中修改的像素位置以及RGB值，然后创建对抗图像。有趣的是，由于使用DE无需像素信息，这使得这种攻击方式成为一种基于查询的黑盒攻击。作者还分析了对一组像素进行修改的情况，例如为了欺骗模型改变五个像素而不是单个像素。

4.5 DeepFool攻击

Moosavi-Dezfooli et al. [59]关注于解决如下问题来最小化对抗扰动的范数，而不是将扰动的范数限定为某个特定的值：

$Δ(I,l):=min⁡ρ∥ρ∥2s.t.l~≠l(10)\Delta(I,l):=\min\limits_{\rho}\Vert\rho\Vert_2\quad s.t.\quad\tilde{l}\ne l\quad\quad\quad\quad\quad(10)$

使用最小范数限定计算对抗扰动的主要动机是有效量化目标模型的对抗鲁棒性，其中鲁棒性的定义如下：

$ρadv=EIΔ(I;l)∥I∥2(11)\rho_{adv}=\mathbb{E}_I\frac{\Delta(I;l)}{\Vert I\Vert_2}\quad\quad\quad\quad\quad(11)$

这里 $EI\mathbb{E}_I$ 是对数据分布求期望。

DeepFool通过计算（10）中的 $ρ\rho$ 来计算（11）中定义的鲁棒性。迭代算法不断将图片推向决策边界，如图5所示：

图片在每轮迭代中都通过加上额外的对抗扰动进行更行，尽管这种方法最早提出是用于量化模型的鲁棒性，DeepFool现在已经被视为是一种有效的针对特定图片的对抗攻击。

4.6 C&W攻击

Defensive distillation [60] 是一个用于防御对抗攻击的有效方案，这个概念建立于深度网络的知识蒸馏技术[61]。然而Carlini & Wagner [62] 发明了一种可以完全打破这种防御机制的攻击方式，这种攻击方式计算norm-restricted additive perturbations。他们同样展示了这种攻击能成功欺骗黑盒设定下的防御蒸馏网络，此时扰动通过使用一个unsecured白盒模型计算得到。他们攻击的转移性显著破化了防御蒸馏的效果。

Carlini and Wagner通过解决如下的优化问题来得到对抗扰动：

$min⁡∥ρ∥p+c.f(I=ρ),s.t.I+ρ∈[0,1]m(12)\min\Vert\rho\Vert_p+c.f(I=\rho),\quad s.t.\ I+\rho\in[0,1]^m\quad\quad\quad\quad(12)$

这里 $f (.)$ 是一个满足 $M(I~)→l~,⟺f(I+ρ)≤0\mathcal{M}(\tilde{I})\to\tilde{l},\quad\iff f(I+\rho)\le0$ 的方程。作者分析了f(.)的一系列形式来计算需要的扰动。Carlini and Wagner [62]使用l_2,l_{\infty} $以及$ l_0$来限定扰动限，这产生了一组新的攻击。作者后来证明，他们的攻击对其他防御技术也是有效的[63]。 The Carlini & Wagner (C&W)攻击通常被认为是一种强力攻击，但是它的计算成本也是十分巨大的。

4.7 Universal Adversarial Perturbations

上面提及的方法都使用一张特定的图片来计算对抗扰动欺骗目标模型。Moosavi-Dezfooli et al. [35] 关注于在图片信息未知时计算对抗扰动，如图6所示：

这些扰动旨在满足如下的限制：

$pI∼I(M(I)≠M(I+ρ))≥δs.t.∥ρ∥p≤η(13)p_{I\sim\mathcal{I}}(\mathcal{M}(I)\ne\mathcal{M}(I+\rho))\ge\delta\quad s.t.\quad\Vert\rho\Vert_p\le\eta\quad\quad\quad\quad\quad(13)$

这里 $P (.)$ 指代概率， $I\mathcal{I}$ 指代干净图片的分布， $δ∈(0,1]\delta\in(0,1]$ 是一个预定义的标量。最终得到的通用对抗扰动被证明在 $l_2$ 以及 $l∞l_{\infty}$ 限定下都是有效的。从[35]的实验中可以观测到，将扰动限定在各自图像的4%范围内能够在流行的ImageNet模型上实现较高的欺骗率（大约80%）例如，ResNet [12], Inception [14]. 然而，图像中4%的失真通常会被人类视觉系统轻微感知，因此作者将扰动称为quasi-imperceptible。；

通用对抗扰动在不同模型间有较好的转移性。然而，由于得到的扰动取决于参数 $δ\delta$ 和 $η\eta$ ，因此我们也常用 $(δ,η)(\delta,\eta)$ 来表示这种通用扰动。 Moosavi-Dezfooli et al. 通过Deepfool [59]来得到这些扰动（图片被逐渐更改来跨过它的类所在的决策边界）。在通用扰动的情况下，迭代算法会逐渐将所有的数据点改变到其对应决策边界的外部，原论文中提到，仅使用2000张训练图像来进行通用对抗仍然可以在ImageNet上实现50%的欺骗率。

5 RECENT ATTACKS ON CLASSIFIERS

Mainly building on the core concepts of the first-generation
attacks, there have been a multitude of more recent attacks on
image classifiers. We cover those attacks in this section as per
the structure illustrated in Fig. 2©.

5.1 Advanced gradient based attacks

There is still a variety of contributions that are intended to improve the core strategy of gradient ascend for adversarial attacks. 很自然地，这些方法可以被视为是对第一代攻击（例如FGSM或PDG）的downstream fine-tuning。例如Dong et al. [64] proposed to focus the gradient-based perturbation computed in an FGSM-like manner on the salient regions of images with the help of super-pixel guided attention. Such perturbations are claimed to be more robust against image processing based defenses. Similarly, Guo et al. [65] focused on improving the transferability of gradient-based attacks by backpropagating the computed gradients linearly through the model. Their gradient backpropagation mimics the scenario in which nonlinear activations are not encountered in the forward pass. Their modification is claimed to achieve better transferability of gradient-based attacks on large scale models.

Dong et al. [66] proposed a so-called GreedyFool algorithm that performs a sparse distortion in the input image based on gradients of its pixels. With improved sparsity, the perceptibly of their gradient-based perturbations becomes lower. Sriramanan et al. [67] proposed a Guided Adversarial Margin Attack (GAMA) that introduces a relaxation term in the standard losses (e.g. cross-entropy) of gradient-based attacks, e.g. PGD. It is claimed that this modification allows the attack to find better gradient directions, thereby increasing its efficacy. Similarly, Tohsiro et al. [68] devised a gradient-based strategy called Output Diversity Sampling (ODS) that is claimed to improve attacks in both white and black-box setups. Many adversarial attacks use random sampling of distributions, e.g. for initializing optimization process or updating query (in black-box setup). The ODS is mainly directed to provide a better sampling scheme for such attacks.

In [69], decoupling of the direction and norm of `2-norm
bounded gradient-based perturbations was proposed to make
the attack more lethal. The resulting attack is commonly
referred to as Decoupled Direction and Norm (DDN) attack.
In [70], Yao et al. recommended to upgrade the first generation gradient-based attacks with Trust Regions [71]. During
optimization, trust regions around the current point in the loss
landscape finds descent/ascent directions that reduce errors due
to the local nature of decisions. It is shown that multiple firstgeneration attacks can be improved for norm reduction a computational efficiency using trust regions. In [72], Phan et
al. argue to also consider the influence of image processing
pipeline of cameras in attacks. They develop a gradient-based
attack by differential approximation of this pipeline such that
their perturbations are able to fool classifiers by images from
one camera pipeline and not for another.

We emphasize that although we categorize only a few
methods under advanced gradient attacks, nearly all white
box (and transfer-based) attacks can be placed under this title,
because those attacks inadvertently deal with model gradients
rather directly. However, we introduce those attacks under
subcategories more suited to their objective or threat models.
Our intention to include a separate subsection for ‘advanced’
gradient attacks is to emphasize on the fact that improving
the core gradient ascend scheme for attacks is still an active
direction in this domain. The gradient based attacks, which
are inherently white box, are generally the easiest to compute.
Hence, they are the hardest to defend against. This makes them
a useful tool to analyze model robustness.

5.2 黑盒攻击

从纯对抗角度来看，黑盒攻击是最实用的攻击，因为它不要求（或要求极少）关于目标模型的信息。它们的实用性也使得它们最近十分流行。我们从query-based以及transfer-based两种角度来衡量黑盒攻击。

Query-based 攻击

这种攻击查询目标模型的输出并使用这些输出来构建对抗图片。大体上讲，黑盒的目标是实现对抗样本的minimal distortion同时maintaining model fooling。查询通常被用来refining stronger perturbations for imperceptibility, 如图7所示。由于基于查询的攻击更实际，因此它们相比socre-based的攻击方式更popular。

最近 Rahmati et al. [73] 介绍了一种通过探索决策边界几何性质来发起黑盒攻击的framework，这种攻击仅需要对目标模型进行少量查询并且只需要模型返回top-1标签。这种攻击利用了决策边界在数据点附近的smaller ‘mean’ curvature来估计normal vector, 沿这个法向量数据点可以通过添加很小的 $l_p$ -norm（ $p≥1p\ge1$ ）扰动移动到决策边界的另一边。作者同样展示了计算得到的扰动converges to the minimal norm for p = 2 for curvature-bounded decision boundaries. Better performance in terms of the number of queries and perturbation norm are reported as compared to the Boundary attack [74], HopSkipJump attack [75] and the qFool attack [76].

The Customized Adversarial Boundary (CAB) attack [77] reduces the number of queries by customising adversarial noise distribution with the queries in query-history, and initializing with perturbations already aimed for transferable attacks. 类似地，为了提升查询效率，a technique to extract generalizable prior using the earlier queries with meta learning is proposed in [78]. Another effort to improve query efficiency includes Projection & Probability-Driven Black-box Attack (PPBA) [79] that restricts the solution space of the problem with low-frequency constrained sensing matrix - a concept inspired by compressive sensing theory. Li et al. [80] proposed a Query Efficient Boundary-based Black-box Attack (QEBA), that iteratively adds perturbation to a source image to retain its original label, but alters the image to form a perceptibly clean target image of a different object.

[81]提出了一种基于贝叶斯优化的攻击方式。 One method to reduce the number of queries it to search for adversarial images in a lower dimensional latent space as compared to the original image space. In that case, estimating the correct dimensionality of the latent space becomes a problem of its own. Ru et al. [81] employ non-parametric Bayesian strategy to resolve that by exploiting Gaussian Processes [82] based surrogate models to generate queries. Cheng et al. [83] also claimed a query efficient attack, altering the optimization objective of their previous work [84] that performed a binary search to estimate the gradient of the target model using query results. Later, they improved the attack by drastically reducing the number of queries by focusing on estimating gradient signs instead of gradients [83]. In another attempt to decrease the number of queries, Cheng et al. [85] also introduced a prior-guided random gradient-free method.

A TRansferable EMbedding based Black-box Attack (TREMBA) [86] trains an encoder-decoder model to learn a low-dimensional embedding space, where an adversarial example is searched for a given target model in a query-based setup. 这个过程is claimed to通过减小查询空间来显著降低搜索次数。Another method looking at the problem from search space perspective performs the attack as a progressive binary search using the gradient signs (instead of magnitude) [87]. The attack shows fooling of MNIST models with as little as 12 queries. Ilyas et al. [88] revisited the zeroth-order optimization (zoo) and proposed a query-based attack using bandit optimization that exploits prior information about the target model gradient. From zoo perspective, Zhao et al. [89] also proposed to augment the optimization with an ADMM-based framework.

基于查询的黑盒攻击目前正在吸引越来越多的关注。最近已经有许多工作关注于这种攻击，例如[90], [91], [92], [84], [93]。 Mostly, the current literature is dealing with decision-based attacks [94], [95], [96], [97], [98]. However, score-based attack schemes are also frequently encountered in the recent literature [99], [100].

结论：

本文综述了深度学习视觉模型的对抗攻击和防御算法。这一相对较新的研究领域术语较多，因此本文也对部分术语进行了介绍。

尽管深度神经网络在许多计算机视觉任务中具有很高的精确度，但人们发现它们容易受到对抗扰动的影响，从而导致整个模型的输出发生变化。由于深度学习是当前人工智能发展的核心，因此这一研究方向已经引起了大量的兴趣，每年也有许多新的工作。本文回顾了这些工作，并且为了确保所讨论内容的真实性和质量，本综述主要记录了顶会论文。从这些论文中可以看出，对抗攻击是深度学习模型实际部署中的一个极大的威胁。现有的文献表明，目前不仅可以使用数字图像进行攻击，我们还可以在真实世界中使用打印的图片发起攻击。最后，由于这一研究方向的活跃程度很高，我们希望深度学习在未来能够显示出对于对抗攻击的较强鲁棒性。

The existing literature demonstrates that currently deep learning can not only be effectively attacked in cyberspace but also in the physical world. However, owing to the very high activity in this research direction it can be hoped that deep learning will be able to show considerable robustness against the adversarial attacks in the future.