Explaining Classifiers using Adversarial Perturbations on the Perceptual Ball论文解读

论文地址

个人认为这篇文章是比较出色的，符合奥卡姆剃刀原则（simple but work）。整体的思想就是在优化对抗扰动的那个约束最优化问题时，基于中间层的feature map又加了限定条件使得最终生成的对抗扰动能围绕在待判定图片的对象周围，使得对抗扰动的集合本身具备了一定的语义信息。详细图解可以参见图2.

摘要

这篇文章提出了一种对于对抗扰动的基于perceptual loss的正则化方式（这篇文章最核心的idea）。这种方式得到的结果同样对人眼无法识别，且这种方式生成的扰动和目前其他方式生成的扰动有一个很大的不同：这些扰动是semi-sparse的更改，能够在突出objects and regions of interest的同时保持背景信息不变。

作者在几个标准的explainability benchmarks上验证了提出的方法。

1. Introduction

·这篇文章展示了为什么在图像数据上进行的能够使分类器判别结果改变的微小改动却不会导致semantically meaningful alteration。假设一张图片被判定为鸟，那么我们可能认为改变这张图片判别结果的最小修改应该作用于图片上鸟所在的像素，但实际上对抗扰动进行的是non-local改变来更改判别器的行为。为了尝试解决上面的breakage，我们通过使用perceptual loss在判别器mid-level response上加了惩罚项，最终获取的结果有着semantically meaningful changes并且会突出图片中待识别对象的信息。（如图1、2所示）

通常对于对抗扰动而言有两个问题：

为什么这些扰动imperceptible？
为什么这些扰动没有定位在物体上？

（这里有一个基础问题，feature map到底是个什么东西，代表了什么信息？有空需要加深下对于feature map的理解）

对于图片上的imperceptible对抗扰动的存在有着如下两种令人信服的解释：

第一种[16]认为这些对抗扰动是人造的高维空间，因此，每个像素上微小扰动的加和最终会对判别器的结果产生非常大的影响，在线性判别器中已经发现了类似的性质
第二种尝试理解为什么稀疏攻击（甚至单像素攻击）能够存在并且将对抗扰动的有效性归因于梯度爆炸。梯度爆炸是因为神经网络构造时遵循a product of (convolutional) matrix operations interlaced with nonlinearities的结构。并且在非线性结构近似表现为线性的directions/locations，雅可比矩阵的特征值会随深度呈指数增长。尽管这种现象在神经网络的训练中受到的广泛的研究并且已经提出了比如normalization [19] and gradient clipping [30]的补救措施，但是在生成对抗扰动时这种现象依然存在。也是因为上述原因，一个仔细选择的微小扰动可以对深度网络最终的判别结果产生极大的影响。

为了探索如何让这些解释相互fit，以及哪种解释能解释对抗扰动的familiar behavior，本文提供了一种简单且novel的正则化方式，这种方式通过正则化图片和对抗扰动后图片的perceptual distance来bounds判别器response的指数增长。

（这里介绍的manifold也是知识系统的一个补充，相比于整个像素空间，图片其实是整个空间的一个manifold（也可以被看做是子空间？有空的时候manifold的内容可以仔细看看，这里给几个介绍manifold的链接）

论文：Generative Visual Manipulation on the Natural Image Manifold
论文：Image Manifolds
维基百科关于manifold的描述
quaro回答
中文描述
一种常见的对于对抗扰动的批评是生成的图片位于自然图片的manifold外，并且如果我们能够从manifold中采样，我们的对抗扰动would be both larger并且相对真实世界更具代表性。将对抗扰动限定再manifold内应该会limit the impact of exploding gradients，即如果samples are drawn from this space，那么一个well-trained判别器应该implicitly reflect the smoothness of the true labels of the underlying data distribution.

尽管it is believed that自然图片的manifold是低维的 [22], 描述除了handwritten digits之外的manifold已经被证明是极其困难的（[46]中的实验部分进行了讨论）。我们的方法提供了一种complementary lightweight alternative. Rather than attempting to characterize the manifold, we penalize search directions that exploit exploding gradients as these encourage movement off the data manifold when searching for minimal adversarial perturbations.

我们提出了基于perceptual loss的对对抗扰动的a novel regularization。我们的新的扰动倾向于highlight图片中objects and regions of interes（参见图1）。 We evaluate on several standard explainability challenges for image classifiers and further validate using the sanity checks of [1].

2. Prior Work

一些工作尝试在扰动上添加额外的限定来使得生成的图片更加可信。Such works restrict the space of perturbations considered
by trying to find an adversarial perturbation that confounds
many classifiers at once [8], or is robust to image warps [2].
其他的方式仅考虑单个图片以及单个判别器，但是限定对抗扰动位于plausible图片的manifold上 [15, 39, 43, 46].。这些方式最主要的限制就是它们需要一个plausible的自然图片生成器，但是这在一些比较简单的数据集上是可行的，例如MNIST，但是 currently out of reach for even the 224 by 224 thumbnails
used by typical ImageNet [37] classifiers.

Adversarial Perturbations and Counterfactuals

已经存在许多工作将adversarial perturbation和counterfactual联系在一起[52,53]。This relationship follows from the definition in philosophy and folk psychology of a counterfactual explanation as answering the question“

待补充2

对抗扰动以及梯度方法

解释计算机视觉的大多数方法倾向于是gradient或importance-based方法，这些方法将一个importance weight赋给图片中的每个像素、superpixel或者中层的neuron。这些梯度方式和对抗扰动是紧密相关的。实际上，with大多数现代网络being piecewise linear（？），如果the found对抗扰动和原始图片位于同一个linear piece，原始图片以及最近的对抗扰动under the l2l_2l2 norm is equivalent to the direction of steepest descent, up to scaling. As such, l2l_2l2对抗扰动可以被认为是
thought of as a slightly robustified method of estimating the
gradient, that takes into account some local non-linearities.

3. Methodology

将判别器定义为C(⋅)C(\cdot)C(⋅)，判别器输入为图像xxx，输出为kkk维confidence vector。

对于那些将每个图像判别为一类的判别器，我们设图像给出的标签i=arg max⁡jCj(x)i=\argmax\limits_{j}C_j(x)i=jargmaxCj(x)。给定一个标签为 iii 的图片 xxx ，考虑如下的多类别边界：

Mi(x′)=Ci(x′)−max⁡j≠iCj(x′)(1)M_i(x')=C_i(x')-\max\limits_{j\ne i}C_j(x')\quad\quad\quad\quad\quad(1)Mi(x′)=Ci(x′)−j=imaxCj(x′)(1)

注意当且仅当 C(⋅)C(\cdot)C(⋅) 判定图片 x′x'x′ 标签不为 iii 时，Mi(x′)≤0M_i(x')\le0Mi(x′)≤0。

对于一个将每个图像划分为多个类的判别器C(⋅)C(\cdot)C(⋅)（即4.4节的pointing game），我们假定判别器将图片xxx判定为标签 I={∀j:Cj(x)>0}I=\{\forall j:C_j(x)>0\}I={∀j:Cj(x)>0}。对每个 i∈Ii\in Ii∈I ，我们感兴趣的是判别器对每个标签的response，因此我们将边界定义为：

Mi(x′)=Ci(x′)(2)M_i(x')=C_i(x')\quad\quad\quad\quad\quad(2)Mi(x′)=Ci(x′)(2)

同样的，当且仅当 C(⋅)C(\cdot)C(⋅) 没有将图片判别为标签 iii 时才会出现 Mi(x′)≤0M_i(x')\le0Mi(x′)≤0 。在这两种情况下，我们可以通过最小化如下的问题来获取对抗扰动：

(Mi(x′)−T)2(3)(M_i(x')-T)^2\quad\quad\quad\quad\quad\quad(3)(Mi(x′)−T)2(3)

这里TTT是一个小于0的target value（这里没看懂）。It is well-known[44] that minimizing a loss of the form：

(Mi(x′)−T)2+λ∥x′−x∥22(4)(M_i(x')-T)^2+\lambda\Vert x'-x\Vert_2^2\quad\quad\quad\quad\quad(4)(Mi(x′)−T)2+λ∥x′−x∥22(4)

等价于找到等式3位于由∥x−x′∥22≤ρ\Vert x-x'\Vert_2^2\le\rho∥x−x′∥22≤ρ定义的球体内的最小值。As such，minimizing this objective for an appropriate value of λ\lambdaλ and TTT 是一个好的策略for finding adversarial perturbations of image xxx with a small l2l_2l2 norm。

将C(l)(x)C^{(l)}(x)C(l)(x)表示为classifier response of the lthl^{th}lth layer of the neural net，we consider the related loss：

(Mi(x′)−T)2+λ′∑l∈L∥C(l)(x′)−C(l)(x)∥22+λ∥x′−x∥22(5)(M_i(x')-T)^2+\lambda'\sum\limits_{l\in\mathcal{L}}\Vert C^{(l)}(x')-C^{(l)}(x)\Vert_2^2+\lambda\Vert x'-x\Vert_2^2\quad\quad\quad\quad(5)(Mi(x′)−T)2+λ′l∈L∑∥C(l)(x′)−C(l)(x)∥22+λ∥x′−x∥22(5)

defined over a set of layers of the neural network L\mathcal{L}L

上式中的第二项是[20]中定义的perceptual loss，并且最小化这个函数等价于找到a minimizer of 方程4 subject to the requirement that x′x'x′ lies in the ball defined by ∑l∈L∥C(l)(x′)−C(l)(x)∥22≤ρ′\sum\limits_{l\in\mathcal{L}}\Vert C^{(l)}(x')-C^{(l)}(x)\Vert_2^2\le\rho'l∈L∑∥C(l)(x′)−C(l)(x)∥22≤ρ′。

为了将对抗扰动转化为saliency map，我们首先通过计算the average squared difference over the channels 来获取每个像素内的对抗扰动大小。然后，为了highlight areas with large changes，我们应用了一个Gaussian blur with parameter σ\sigmaσ to the differences to give our resultant saliency map。

We systematically evaluate the effect of altering the regularized layers for a range of tasks。We find that the method is relatively stable and Eq.(5) performs better than the unregularized Eq(4) for weak localization, insertion and deletion and the pointing game. As shown in Fig.2, as more layers are regularized the perturbation becooms more localized.