XAI系列基础知识之Grad-CAM

Global Average Pooling(GAP)

参考：
深度学习基础系列（十）| Global Average Pooling是否可以替代全连接层？和深度学习|Global Average Pooling
Network In Network中对GAP的描述：
In this paper, we propose another strategy called global average pooling to replace the traditional fully connected layers in CNN. The idea is to generate one feature map for each corresponding category of the classification task in the last mlpconv layer. Instead of adding fully connected layers on top of the feature maps, we take the average of each feature map, and the resulting vector is fed directly into the softmax layer. One advantage of global average pooling over the fully connected layers is that it is more native to the convolution structure by enforcing correspondences between feature maps and categories. Thus the feature maps can be easily interpreted as categories confidence maps. Another advantage is that there is no parameter to optimize in the global average pooling thus overfitting is avoided at this layer. Futhermore, global average pooling sums out the spatial information, thus it is more robust to spatial translations of the input.
用图来表示：

从图中可以直观看出GAP就是对每张特征图取其均值，用这个均值来表示该特征图送入softmax计算。

Grad-CAM

参考文献：
深度学习论文笔记（可解释性）——CAM与Grad-CAM
在讲之前先明确一点：CNN最后一层特征图富含有最为丰富类别语意信息。

import torch
import torch.nn.functional as Fdef find_vgg_layer(arch, target_layer_name):"""Find vgg layer to calculate GradCAM and GradCAM++Args:arch: default torchvision densenet modelstarget_layer_name (str): the name of layer with its hierarchical information. please refer to usages below.target_layer_name = 'features'target_layer_name = 'features_42'target_layer_name = 'classifier'target_layer_name = 'classifier_0'Return:target_layer: found layer. this layer will be hooked to get forward/backward pass information."""hierarchy = target_layer_name.split('_')if len(hierarchy) >= 1:target_layer = arch.featuresif len(hierarchy) == 2:target_layer = target_layer[int(hierarchy[1])]return target_layerclass GradCAM(object):"""Calculate GradCAM salinecy map.A simple example:# initialize a model, model_dict and gradcamresnet = torchvision.models.resnet101(pretrained=True)resnet.eval()model_dict = dict(model_type='resnet', arch=resnet, layer_name='layer4', input_size=(224, 224))gradcam = GradCAM(model_dict)# get an image and normalize with mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)img = load_img()normed_img = normalizer(img)# get a GradCAM saliency map on the class index 10.mask, logit = gradcam(normed_img, class_idx=10)# make heatmap from mask and synthesize saliency map using heatmap and imgheatmap, cam_result = visualize_cam(mask, img)Args:model_dict (dict): a dictionary that contains 'model_type', 'arch', layer_name', 'input_size'(optional) as keys.verbose (bool): whether to print output size of the saliency map givien 'layer_name' and 'input_size' in model_dict."""def __init__(self, model_dict, verbose=False):model_type = model_dict['type']layer_name = model_dict['layer_name']self.model_arch = model_dict['arch']self.gradients = dict()self.activations = dict()def backward_hook(module, grad_input, grad_output):self.gradients['value'] = grad_output[0]return Nonedef forward_hook(module, input, output):self.activations['value'] = outputreturn Noneif 'vgg' in model_type.lower():target_layer = find_vgg_layer(self.model_arch, layer_name)target_layer.register_forward_hook(forward_hook)target_layer.register_backward_hook(backward_hook)if verbose:try:input_size = model_dict['input_size']except KeyError:print("please specify size of input image in model_dict. e.g. {'input_size':(224, 224)}")passelse:device = 'cuda' if next(self.model_arch.parameters()).is_cuda else 'cpu'self.model_arch(torch.zeros(1, 3, *(input_size), device=device))print('saliency_map size :', self.activations['value'].shape[2:])def forward(self, input, class_idx=None, retain_graph=False):"""Args:input: input image with shape of (1, 3, H, W)class_idx (int): class index for calculating GradCAM.If not specified, the class index that makes the highest model prediction score will be used.Return:mask: saliency map of the same spatial dimension with inputlogit: model output"""b, c, h, w = input.size()logit = self.model_arch(input)print(logit.shape)if class_idx is None:score = logit[:, logit.max(1)[-1]].squeeze()  # get the max socreprint(score)else:score = logit[:, class_idx].squeeze()self.model_arch.zero_grad()score.backward(retain_graph=retain_graph)gradients = self.gradients['value']activations = self.activations['value']# print(gradients.shape, activations.shape)  # torch.Size([1, 512, 14, 14]) torch.Size([1, 512, 14, 14])b, k, u, v = gradients.size()alpha = gradients.view(b, k, -1).mean(2)  # torch.Size([1, 512])# alpha = F.relu(gradients.view(b, k, -1)).mean(2)weights = alpha.view(b, k, 1, 1)  # torch.Size([1, 512, 1, 1])saliency_map = (weights * activations).sum(1, keepdim=True)saliency_map = F.relu(saliency_map)print('saliency_map', saliency_map.shape)saliency_map = F.upsample(saliency_map, size=(h, w), mode='bilinear', align_corners=False)saliency_map_min, saliency_map_max = saliency_map.min(), saliency_map.max()saliency_map = (saliency_map - saliency_map_min).div(saliency_map_max - saliency_map_min).datareturn saliency_map, logitdef __call__(self, input, class_idx=None, retain_graph=False):return self.forward(input, class_idx, retain_graph)

代码解读(注意这里的讲解用的字母都是根据原论文的)：
输入预训练模型，要提取的层（这里用vgg16最后一个MaxPool2d()前的relu(),即features_29），使用hook提取features_29层的激活值和梯度，这层的特征图、激活值和梯度大小为 1 × 512 × 14 × 14 1 \times 512 \times 14 \times 14 1×512×14×14。将每一个特征图用一个GAP获得神经元重要性权重 α k c \alpha_k^c αkc，对应代码和公式：
α k c = 1 Z ∑ i ∑ j ∂ y c ∂ A i j k \alpha_k^c=\frac{1}{Z}\sum_i \sum_j{\frac{\partial y^c}{\partial A^k_{ij}}} αkc=Z1i∑j∑∂Aijk∂yc

alpha = gradients.view(b, k, -1).mean(2)  # torch.Size([1, 512])
weights = alpha.view(b, k, 1, 1)  # torch.Size([1, 512, 1, 1])

We perform a weighted combination of forward activation maps, and follow it by a ReLU to obtain：
L G r a d − C A M c = R e L U ( ∑ k α k c A k ) L_{Grad-CAM}^c=ReLU(\sum_k{\alpha_k^cA^k}) LGrad−CAMc=ReLU(k∑αkcAk)

saliency_map = (weights * activations).sum(1, keepdim=True)
saliency_map = F.relu(saliency_map)

关于上采样的可以看pytorch torch.nn 实现上采样——nn.Upsample

实现结果：从左到右依次为原图，Grad-CAM的heatmap，Grad-CAM叠加后的效果