SmoothGrad: removing noise by adding noise论文解读

Abstract

解释神经网络的输出一直是一个非常困难的任务。对于一个图片分类器而言，一种类型的解释是分别那些对最终结果有很大影响的像素。这种策略的一个starting point is the gradient of the class score function with respect to the input image。这个梯度可以被解释为一个sensitivity map, and there are several techniques that elaborate on this basic idea. 本文提出了两个贡献：

我们设计了SmoothGrad，这种方法可用于帮助 visually sharpen gradient-based sensitivity maps
it discusses lessons in the visualization of these maps

1 Introduction

一种常见的理解图片判别系统的方式是找到对最终判别有显著影响的图片区域。 (Baehrens et al., 2010; Zeiler & Fergus, 2014; Springenberg et al., 2014; Zhou et al., 2016; Selvaraju et al., 2016; Sundararajan et al., 2017; Zintgraf et al., 2016)。这些方法(variously called sensitivity maps, saliency maps, or pixel attribution maps）使用occlusion techniques or calculations with gradients to assign an “importance” value to individual pixels which is meant to reflect their influence on the final classification.

在测试过程中，这些方法有时候能highlight出对人有意义的区域，例如脸部的眼睛。但同时，sensitivity map通常有很多噪音，这时会highlight出一些对于人眼来讲类似随机像素。Of course, a priori we cannot determine if this noise reflects an underlying truth about how networks perform classification, or is due to more superficial factors. Either way, it seems like a phenomenon worth investigating further.

这篇文章描述了一个非常简单的technique，SmoothGrad，在实际使用时能够减少视觉噪音，并且可以和其他sensitivity图算法结合在一起。核心思想是选取一张图片，然后通过在图片上添加噪音来生成新的图片样本，然后取每个采样样本的sensitivity map的均值。我们还发现the common regularization technique of adding noise at training time (Bishop, 1995) has an additional “de-noising” effect on sensitivity maps. The two techniques (training with noise, and inferring with noise) seem to have additive effect; performing them together yields the best results.

这篇文章将SmoothGrad方法和几个基于梯度的sensitivity map方法结合在一起，阐述了SmoothGrad的效果。 We provide a conjecture, backed by some empirical evidence, for why the technique works, and why it might be more reflective of how the network is doing classification. We also discuss several ways to enhance visualizations of these sensitivity maps.

2 Gradients as sensitivity maps

考虑一个将一张图片判定为一个集合 C C C中某一个类的分类器。给定一张输入图片 x x x，许多图像分类网络(Szegedy et al., 2016; LeCun et al., 1998)会计算一个 class activation function S c S_c Sc for each class c ∈ C c\in C c∈C， and the final classification c l a s s ( x ) class(x) class(x) is determined by which class has the highest score. That is,

c l a s s ( x ) = arg max ⁡ c ∈ C S c ( x ) class(x)=\argmax_{c\in C}S_c(x) class(x)=c∈CargmaxSc(x)

（这段可以用于ppt中，介绍梯度可视化方法）A mathematically clean way of locating “important” pixels
in the input image has been proposed by several authors,
e.g., (Baehrens et al., 2010; Simonyan et al., 2013; Erhan
et al., 2009). If the functions Sc are piecewise differentiable, for any image x one can construct a sensitivity map
Mc(x) simply by differentiating Mc with respect to the input, x. In particular, we can define

M c ( x ) = ∂ S c ( x ) / ∂ x M_c(x)=\partial S_c(x)/\partial x Mc(x)=∂Sc(x)/∂x

这里 ∂ S c \partial S_c ∂Sc指代 S c S_c Sc的导数。 Intuitively speaking, M c M_c Mc represents how much difference a tiny change in each pixel of x x x would make to the classification score for class c c c. As a result, one might hope that the resulting map M c M_c Mc would highlight key regions.

In practice, the sensitivity map of a label does seem
to show a correlation with regions where that label is
present (Baehrens et al., 2010; Simonyan et al., 2013).
However, the sensitivity maps based on raw gradients are
typically visually noisy, as shown in Fig. 1. Moreover, as
this image shows, the correlations with regions a human
would pick out as meaningful are rough at best.

2.1 Previous work on enhancing sensitivity maps

There are several hypotheses for the apparent noise in raw
gradient visualizations. One possibility, of course, is that the maps are faithful descriptions of what the network is doing. Perhaps certain pixels scattered, seemingly at random,
across the image are central to how the network is making
a decision. On the other hand, it is also possible that using
the raw gradient as a proxy for feature importance is not
optimal. Seeking better explanations of network decisions,
several prior works have proposed modifications to the basic technique of gradient sensitivity maps; we summarize a
few key examples here.

One issue with using the gradient as a measure of influence is that an important feature may “saturate” the
function Sc. In other words, it may have a strong effect
globally, but with a small derivative locally. Several approaches, Layerwise Relevance Propagation (Bach et al.,
2015), DeepLift (Shrikumar et al., 2017), and more recently
Integrated Gradients (Sundararajan et al., 2017), attempt to
address this potential problem by estimating the global importance of each pixel, rather than local sensitivity. Maps
created with these techniques are referred to as “saliency”
or “pixel attribution” maps.

Another strategy for enhancing sensitivity maps has been
to change or extend the backpropagation algorithm itself,
with the goal of emphasizing positive contributions to the
final outcome. Two examples are the Deconvolution (Zeiler
& Fergus, 2014) and Guided Backpropagation (Springenberg et al., 2014) techniques, which modify the gradients
of ReLU functions by discarding negative values during the
backpropagation calculation. The intention is to perform a
type of “deconvolution” which will more clearly show features that triggered activations of high-level units. Similar
ideas appear in (Selvaraju et al., 2016; Zhou et al., 2016),
which suggest ways to combine gradients of units at multiple levels.

In what follows, we provide detailed comparisons of
“vanilla” gradient maps with those created by integrated
gradient methods and guided backpropagation. A note
on terminology: although the terms “sensitivity map”,
“saliency map”, and “pixel attribution map” have been used
in different contexts, in this paper, we will refer to these
methods collectively as “sensitivity maps.”

2.2 Smoothing noisy gradients

sensitivity图中的噪音有一个可能的解释：函数 S c S_c Sc的导数 may fluctuate sharply at small scales. In other words, the
apparent noise one sees in a sensitivity map may be due to essentially meaningless local variations in partial derivatives. After all, given typical training techniques there is no reason to expect derivatives to vary smoothly. Indeed, the networks in question typically are based on ReLU activation functions, so S c S_c Sc generally will not even be continuously differentiable。

图2给出了 strongly fluctuating partial derivatives的例子：

This fixes a particular image x x x, and an image pixel x i x_i xi , and plots the values of ∂ S c ∂ x i ( t ) \frac{\partial S_c}{\partial x_i}(t) ∂xi∂Sc(t) as fraction of the maximum entry in the gradient vector, max ⁡ i ∂ S c ∂ x i ( t ) \max_i\frac{\partial S_c}{\partial x_i}(t) maxi∂xi∂Sc(t)，for a short line segment x + t ϵ x+t\epsilon x+tϵ in the space of images parameterized by t ∈ [ 0 , 1 ] t\in[0,1] t∈[0,1]。We show it as a fraction of the maximum entry in order to verify that the fluctuations are significant. The length of this segment is small enough that the starting image x x x and the final image x + ϵ x+\epsilon x+ϵ looks the same to a human. Furthermore, each image along the path is correctly classified by the model. The partial derivatives with respect to the red, green, and blue components, however, change significantly.

Given these rapid fluctuations, the gradient of S c S_c Sc at any given point will be less meaningful than a local average of gradient values. This suggests a new way to create improved sensitivity maps: instead of basing a visualization directly on the gradient ∂ S c \partial S_c ∂Sc, we could base it on a smoothing of ∂ S c \partial S_c ∂Sc with a Gaussian kernel。

Directly computing such a local average in a highdimensional input space is intractable, but we can compute
a simple stochastic approximation. In particular, we can
take random samples in a neighborhood of an input x, and
average the resulting sensitivity maps. Mathematically, this
means calculating：

M ^ c ( x ) = 1 n ∑ 1 n M c ( x + N ( 0 , σ 2 ) ) \hat{M}_c(x)=\frac{1}{n}\sum\limits_1^nM_c(x+\mathcal{N}(0,\sigma^2)) M^c(x)=n11∑nMc(x+N(0,σ2))

这里 n n n是数据样本的数量， N ( 0 , σ 2 ) \mathcal{N}(0,\sigma^2) N(0,σ2)代表Gaussian noise with standard deviation σ \sigma σ。我们在这篇文章中将这种方式称为SmoothGrad。

3 Experiment

为了评估SmoothGrad，我们利用一个用于图片分类的神经网络进行了一系列实验 (Szegedy et al., 2016; TensorFlow, 2017)。结果显示 the estimated smoothed gradient, M ^ c \hat{M}_c M^c，能够生成比unsmoothed gradient M c M_c Mc视觉上更coherent sensitivity maps，with the resulting visualizations aligning better–to the human eye–with meaningful features.

Our experiments were carried out using an Inception v3 model (Szegedy et al., 2016) that was trained on the ILSVRC-2013 dataset (Russakovsky et al., 2015) and a convolutional MNIST model based on the TensorFlow tutorial (TensorFlow, 2017).

3.1 Visualization methods and techniques

Sensitivity maps通常被可视化为heatmaps。 Finding the right mapping from a channel values at a pixel to a particular color turns out to be surprisingly nuanced, and can have a large effect on the resulting impression of the visualization. 这部分介绍一些可视化方法 and lessons learned in the process of comparing various sensitivity map work. Some of these techniques may be universally useful regardless of the choice of sensitivity map methods.

Absolute value of gradients

Sensitivity图算法通常会产生signed values。 There is considerable ambiguity in how to convert signed values to colors. A key choice is whether to represent positive and negative values differently, or to visualize the absolute value only. The utility of是否使用梯度的绝对值取决于the characteristics of the dataset of interest. For example, when the object of interest has the same color across the classes (e.g., digits are always white in MNIST digits (LeCun et al., 2010)), the positive gradients indicate positive signal to the class. 另一方面，对于ImageNet数据集，我们发现采用梯度的绝对值 produced clearer pictures。这种现象的一个可能的解释是 the direction is context dependent: many image recognition tasks are invariant under color and illumination changes. 例如，在区分一个球时，一个bright background上的dark ball会产生负梯度，while darker background上的white ball会产生正梯度。

Capping outlying values

我们观察到的梯度的另外一个性质是the presence of few pixels that have much higher gradients than the average. This is not a new discovery — this property was utilized in generating adversarial examples that are indistinguishable to humans (Szegedy et al., 2013). These
outlying values have the potential to throw off color scales
completely. Capping those extreme values to a relatively
high value (we find 99th percentile to be sufficient) leads
to more visually coherent maps as in (Sundararajan et al.,
2017). Without this post-processing step, maps may end up
almost entirely black.

Multiplying maps with the input images

Some techniques create a final sensitivity map by multiplying gradient-based values and actual pixel values (Shrikumar et al., 2017; Sundararajan et al., 2017). This multiplication does tend to produce visually simpler and sharper
images, although it can be unclear how much of this can
be attributed to sharpness in the original image itself. For
example, a black/white edge in the input can lead to an
edge-like structure on the final visualization even if the underlying sensitivity map has no edges.

However, this may result in undesired side effect. Pixels
with values of 0 will never show up on the sensitivity map.
For example, if we encode black as 0, the image of a classifier that correctly predicts a black ball on a white background will never highlight the black ball in the image.

On the other hand, multiplying gradients with the input images makes sense when we view the importance of the feature as their contribution to the total score, y. For example,
in a linear system y = W x, it makes sense to consider xiwi as the contribution of xi
to the final score y.
For these reasons, we show our results with and without the
image multiplication in Fig. 5.

3.2 Effect of noise level and sample size

SmoothGrad有两个超参数： σ \sigma σ 指代noise level或者standard deviation of the Gaussian perturbations； n n n指代number of samples to average over。

Noise， σ \sigma σ

图3展示了ImageNet部分样本中noise level的作用。

第二列对应standard gradient（0% noise），我们在这篇文章中称其为“Vanilla” method。由于quantitative evaluation of a map remains an unsolved problem, we again focus on qualitative evaluation. We observe that applying 10%-20% noise (middle
columns) seems to balance the sharpness of sensitivity map
and maintain the structure of the original image.We also
observe that while this range of noise gives generally good
results for Inception, the ideal noise level depends on the
input. See Fig. 10 for a similar experiment on the MNIST
dataset.

Sample size, n n n

图4中显示了样本大小 n n n的作用：

和我们期望的相同， the estimated gradient becomes smoother as the sample size, n, increases. We empirically found a diminishing return — there was little apparent change in the visualizations for n > 50.

3.3 Qualitative comparison to baseline methods

Since there is no ground truth to allow for quantitative evaluation of sensitivity maps, we follow prior work (Simonyan
et al., 2013; Zeiler & Fergus, 2014; Springenberg et al.,
2014; Selvaraju et al., 2016; Sundararajan et al., 2017) and
focus on two aspects of qualitative evaluation.
First, we inspect visual coherence (e.g., the highlights are
only on the object of interest, not the background). Second,
we test for discriminativity, where in an image with both a
monkey and a spoon, one would expect an explanation for
a monkey classification to be concentrated on the monkey
rather than the spoon, and vice versa.
Regarding visual coherence, Fig. 5 shows a side-by-side
comparison between our method and three gradient-based
methods: Integrated Gradients (Sundararajan et al., 2017),
Guided BackProp (Springenberg et al., 2014) and vanilla
gradient. Among a random sample of 200 images that we
inspected, we found SMOOTHGRAD to consistently provide more visually coherent maps than Integrated Gradients and vanilla gradient. While Guided BackProp provides the most sharp maps (last three rows of Fig. 5), it
is prone to failure (first three rows of Fig. 5), especially
for images with uniform background. On the contrary, our
observation is that SMOOTHGRAD has the highest impact
when the object is surrounded with uniform background
color (first three rows of Fig. 5). Exploring this difference
is an interesting area for investigation. It is possible that
the smoothness of the class score function may be related
to spatial statistics of the underlying image; noise may have
a differential effect on the sensitivity to different textures.
Fig. 6 compares the discriminativity of our method to other
methods. Each image has at least two objects of different
classes that the network may recognize. To visually show
discriminativity, we compute the sensitivity maps M1(x)
and M2(x) for both classes, scale both to [0, 1], and calculate the difference M1(x) − M2(x). We then plot the values on a diverging color map [−1, 0, 1] 7→ [blue, gray, red].
For these images, SMOOTHGRAD qualitatively shows better discriminativity over the other methods. It remains an
open question to understand which properties affect the discriminativity of a given method – e.g. understanding why
Guided BackProp seems to show the weakest discriminativity

3.4 将SmoothGrad和其他方法结合在一起

我们可以将SmoothGrad视为使用一个简单的过程来smoothing the vanilla gradient method：averaging the vanilla sensitivity maps of n n n noisy images. With that in mind, the same smoothing procedure can be used to augment any gradient-based method. 在图7中，我们展示了将SmoothGrad应用到 Integrated Gradients以及 Guided BackProp的效果：

We observe that this augmentation improves the visual coherence of sensitivity maps for both methods.

3.5 Adding noise during training

SMOOTHGRAD as discussed so far may be applied to classification networks as-is. In situations where there is a premium on legibility, however, it is natural to ask whether
there is a similar way to modify the network weights so
that its sensitivity maps are sharper. One idea that is parallel in some ways to SMOOTHGRAD is the well-known
regularization technique of adding noise to samples during
training (Bishop, 1995). We find that the same method also
improves the sharpness（锐化？） of the sensitivity map.

Fig. 8 and Fig. 9 show the effect of adding noise at training
time and/or evaluation time for the MNIST and Inception
model respectively. Interestingly, adding noise at training
time seems to also provide a de-noising effect on the sensitivity map. Lastly, the two techniques (training with noise,
and inferring with noise) seem to have additive effect; performing them together produces the most visually coherent
map of the 4 combinations.