[论文阅读] Maintaining Discrimination and Fairness in Class Incremental Learning

论文地址：https://openaccess.thecvf.com/content_CVPR_2020/html/Zhao_Maintaining_Discrimination_and_Fairness_in_Class_Incremental_Learning_CVPR_2020_paper.html
发表于：CVPR 20

Abstract

深度神经网络(DNNs)已被应用于类增量学习中，其目的是解决现实世界中常见的不断学习新类的问题。标准DNN的一个缺点是它们容易发生灾难性的遗忘。知识蒸馏(KD)是一种常用的技术来缓解这个问题。在本文中，我们证明了它确实可以帮助模型在旧的类别中输出更多的判别结果。然而，它不能缓解模型倾向于将对象分类到新类的问题，导致KD的积极作用被隐藏和限制。我们观察到，造成灾难性遗忘的一个重要因素是，在类增量学习中，最后一个全连接(FC)层的权重是高度倾斜的。在本文中，我们提出了一个简单而有效的解决方案，其动机是上述的观察，以解决灾难性遗忘。首先，我们利用KD来维持旧类中的判别性。然后，为了进一步保持旧类和新类之间的公平性，我们提出了权重对齐(Weight Aligning, WA)，在正常的训练过程后纠正FC层中有偏见的权重。与以前的工作不同，WA不需要任何额外的参数或事先的验证集，因为它利用了有偏见的权重本身提供的信息。我们在ImageNet-1000、ImageNet-100和CIFAR-100的不同设置下对提出的方法进行了评估。实验结果表明，所提出的方法可以有效地缓解灾难性遗忘，并明显优于最先进的方法。

Method

本文的动机和解法都谈不上新颖，依然是针对类增量学习中输出层(FC)倾向于预测新类的问题，提出添加一些模块进行纠偏。不同的是，本文的这一过程是"无监督"的，并不需要旧模型的统计信息或者额外的验证集之类的辅助。

本文的方法称为权重对齐(Weight Aligning, WA)。具体来说，是将新类权重向量的范数与旧类对齐。形式化地，记新旧类FC层的权重为：W=(Wold ,Wnew)\mathbf{W}=\left(\mathbf{W}_{\text {old }}, \mathbf{W}_{n e w}\right) W=(Wold ,Wnew) Wold =(w1,w2,⋯,wCold b)∈Rd×ColdbWnew=(wColdb+1,⋯,wColdb+Cb)∈Rd×Cb\begin{aligned} \mathbf{W}_{\text {old }} &=\left(\mathbf{w}_{1}, \mathbf{w}_{2}, \cdots, \mathbf{w}_{C_{\text {old }}^{b}}\right) \in \mathbb{R}^{d \times C_{o l d}^{b}} \\ \mathbf{W}_{n e w} &=\left(\mathbf{w}_{C_{o l d}^{b}+1}, \cdots, \mathbf{w}_{C_{o l d}^{b}+C^{b}}\right) \in \mathbb{R}^{d \times C^{b}} \end{aligned} Wold Wnew=(w1,w2,⋯,wCold b)∈Rd×Coldb=(wColdb+1,⋯,wColdb+Cb)∈Rd×Cb 则有相应的正则化形式如下：Norm⁡old =(∥w1∥,⋯,∥wCold b∥)Normnew=(∥wColdb+1∥,⋯,∥wColdb+Cb∥)\begin{aligned} &\operatorname{Norm}_{\text {old }}=\left(\left\|\mathbf{w}_{1}\right\|, \cdots,\left\|\mathbf{w}_{C_{\text {old }}^{b}}\right\|\right) \\ &\text {Norm}_{n e w}=\left(\left\|\mathbf{w}_{C_{o l d}^{b}+1}\right\|, \cdots,\left\|\mathbf{w}_{C_{o l d}^{b}+C^{b}}\right\|\right) \end{aligned} Normold =(∥w1∥,⋯,∥∥∥wCold b∥∥∥)Normnew=(∥∥∥wColdb+1∥∥∥,⋯,∥∥∥wColdb+Cb∥∥∥) 进一步地，可以对新类的权重进行标准化，有：W^new=γ⋅Wnewγ=Mean⁡(Norm old )Mean⁡(Norm new)\begin{gathered} \widehat{\mathbf{W}}_{n e w}=\gamma \cdot \mathbf{W}_{n e w} \\ \gamma=\frac{\operatorname{Mean}\left(\text { Norm }_{\text {old }}\right)}{\operatorname{Mean}\left(\text { Norm }_{n e w}\right)} \end{gathered} Wnew=γ⋅Wnewγ=Mean( Norm new)Mean( Norm old ) 接着回看网络输出的过程。可以简单拆分为xxx的特征被提取，然后再乘以分类头的全连接层进行分类，即：o(x)=(Oold (x)Onew(x))=(Wold Tϕ(x)WnewTϕ(x))\mathbf{o}(\mathbf{x})=\left(\begin{array}{c} \mathbf{O}_{\text {old }}(\mathbf{x}) \\ \mathbf{O}_{n e w}(\mathbf{x}) \end{array}\right)=\left(\begin{array}{c} \mathbf{W}_{\text {old }}^{T} \phi(\mathbf{x}) \\ \mathbf{W}_{n e w}^{T} \phi(\mathbf{x}) \end{array}\right) o(x)=(Oold (x)Onew(x))=(Wold Tϕ(x)WnewTϕ(x)) 由于上文我们对权重进行了对齐，则纠偏后的输出为：ocorrected (x)=(Wold Tϕ(x)W^new Tϕ(x))=(Wold Tϕ(x)γ⋅WnewTϕ(x))=(oold (x)γ⋅onew (x))\begin{aligned} &\mathbf{o}_{\text {corrected }}(\mathbf{x})=\left(\begin{array}{c} \mathbf{W}_{\text {old }}^{T} \phi(\mathbf{x}) \\ \widehat{\mathbf{W}}_{\text {new }}^{T} \phi(\mathbf{x}) \end{array}\right) \\ &=\left(\begin{array}{c} \mathbf{W}_{\text {old }}^{T} \phi(\mathbf{x}) \\ \gamma \cdot \mathbf{W}_{n e w}^{T} \phi(\mathbf{x}) \end{array}\right)=\left(\begin{array}{c} \mathbf{o}_{\text {old }}(\mathbf{x}) \\ \gamma \cdot \mathbf{o}_{\text {new }}(\mathbf{x}) \end{array}\right) \end{aligned} ocorrected (x)=(Wold Tϕ(x)Wnew Tϕ(x))=(Wold Tϕ(x)γ⋅WnewTϕ(x))=(oold (x)γ⋅onew (x)) 总的来说方法确实极其simple，但是由于代码似乎还没有开源，为此也尚不清楚以上操作具体如何实现。