[论文总结]：faster cnns with direct sparse convolutions and guided pruning 直接稀疏卷积和引导剪枝

文章目录

Abstract
Introduction
算法介绍：Direct Sparse Convolution
A evaluation model
GUIDED SPARSITY LEARNING (GSL)
Experients
Conclusion
补充：几种矩阵存储模式
- COO: Coordinate
- CSR：按行压缩存储
- CSC:按列压缩存储

这篇文章作为一篇会议论文在2017年发表于ICLR作者Jongsoo Park，Sheng Li等，来自于Intel实验室。

Abstract

The number of parameters needed in CNNs, however, are often large and undesirable. Consequently, various methods have been developed to prune a CNN once it is trained. Nevertheless, the resulting CNNs offer limited benefits.CNN越来越重要
之前的算法都是prunning FC层，而对速度影响最大的conv层则没有prune。
We present a method to realize simultaneously size economy and speed improvement while pruning CNNs.于是作者提出一种conv层prunning的算法
our success is an efficient general sparse-with-dense matrix multiplication implementation that is applicable to convolution of feature maps with kernels of arbitrary sparsity patterns.这是一种高效的dense matrix和sparse matrix 做multiplication的算法，能够应用于所有dense 的feature map和sparse的kernel之间conv操作
Complementing this, we developed a performance model that predicts sweet spots of sparsity levels for different layers and on different computer architectures.并且提出一种评价model，来预测不同architechtures和不同layer的sparse程度的最佳值。
Together, these two allow us to demonstrate 3.1–7.3× convolution speedups over dense convolution in AlexNet.论文在AlexNet上的conv层进行压缩，获得了3.1~7.3倍的加速。
项目地址：https://github.com/IntelLabs/SkimCaffe.

Introduction

Han et al. (2015; 2016b）的paper中后处理操作：稀疏化、张量分解。但没有考虑CNN的prunning。Han的文章主要就是减小网络的大小（fc层连接去掉的多），但是计算量没有考虑
The crux of speed improvement thus lie in actual fast convolution of sparse kernels with feature maps. conv层压缩的关键就是dense的feature map和稀疏的kernal之间高效的conv操作。所以，如何提高速度，关键就在稀疏核和feature map相乘这一步。
the performance of sparse matrix operations is typically memory bandwidth bound.稀疏矩阵操作的性能通常是内存带宽限制。
(Lebedev & Lempitsky, 2015; Wen et al., 2016)提出了一种group-wise sparsity patterns
Convolutions in CNNs involve multiple channels and thus offer much higher data reuse than typical sparse matrix operations in scientific computing.与之前的paper不同，本文作者从另一种角度来看Conv压缩问题：Conv计算产生了大量的数据重用的计算。
Specifically, we present a highly efficient direct sparse convolution design formulated as sparse-matrix-dense-matrix multiplication with the dense matrix columns generated on-the-fly from a single column vector. 随着column vector，也就是矩阵不停的产生dense matrix columns，作者提出了一种高效的sparse-dense mutiplication
we formulate a performance model to elucidate when and how best to use sparse convolutions on different computer architectures and at different CNN layers. 此外还提出了一种预测最佳sparse程度的model
本文算法称为Guided Sparsity Learning，GSL。在剪枝算法GSL的指导下，结合sparse convolution design和performance model可以让我们以co-design的方式剪CNN。
In some cases, particular layers are identified as best not pruned at all due to no potential speedups, leaving them unchanged gives other layers more room for gainful pruning in terms of size and speed. 有些情况下，某些层会被预测为不用prunning，这在某种程度上为了其他层的更高效的提速提供了空间。

本文的贡献

A high performance sparse convolution design that takes advantage of arbitrary sparsity patterns and outperforms dense convolution even with a moderate sparsity.
高性能稀疏卷积设计，利用任意稀疏模式，即使在适度稀疏的情况下也优于密集卷积。
A general performance model that (1) projects speedups over dense convolutions on varying level/types of sparsity and different computing platforms and (2) provides training guidelines for precisely targeting layers and sparsity ranges with potential to accelerate inference.
一般性能模型，（1）在不同级别/类型的稀疏度和不同计算平台上预测密集卷积的加速，以及(2)为精确定位具有加速推断潜力的层和稀疏范围提供训练指导。
Guided Sparsity Learning (GSL), the first pruning algorithm fusing the awareness of speedup potential into sparsity learning; and its application to AlexNet and GoogLeNet. In particular, in GoogLeNet, we prune out more than 80% of parameters of all 5×5/3×3 conv layers and fc layers with no accuracy drop.
指导稀疏学习(GSL)，第一个剪枝算法融合了加速潜能的感知稀疏学习;以及它在AlexNet和GoogLeNet上的应用。特别是在GoogLeNet中，剔除了所有5×5/3×3 conv层和fc层80%以上的参数，没有出现精度下降。
An optimized sparse convolution implementation (http://github.com/IntelLabs/SkimCaffe) that provides 7.3×, 3.4×, 3.1× speedups of convolution layers in AlexNet over dense methods on Intel Atom, Xeon, and Knights Landing processors, respectively, with no accuracy drop. In particular, this paper is one of the first evaluations of Xeon Phi processors on deep learning algorithms.
优化的稀疏卷积实现（http://github.com/IntelLabs/SkimCaffe），分别通过Intel Atom，Xeon和Knights Landing处理器上的密集方法在AlexNet中提供7.3倍，3.4倍，3.1倍的卷积层加速没有精确度下降。这篇文章是Xeon Phi处理器在深度学习算法上的首次评估。

算法介绍：Direct Sparse Convolution

卷积层的sparse加速

A sparse convolution for the all output positions across all output channels can be eventually considered as a virtual sparse-matrix-dense-matrix multiplication. 稀疏卷积可以看作抽象的sparse-matrix-dense-matrix multiplication
假设有N个filters，每个的size为R×S。输入为C×H×W，于是这层的参数为一个N×C×R×S的张量，input是C×H(in)×W(in)的张量，output为C×H(out)×W(out)的张量，如下：两个三维张量的点积

O(n,y,x)=∑c=0C−1∑r=0R−1∑s=0S−1W(n,c,r,s)I(c,y+r,x+s)O(n,y,x)=\sum_{c=0}^{C-1} \sum_{r=0}^{R-1} \sum_{s=0}^{S-1}W(n,c,r,s)I(c,y+r,x+s) O(n,y,x)=c=0∑C−1r=0∑R−1s=0∑S−1W(n,c,r,s)I(c,y+r,x+s)

这个操作可以被看成首先根据channnels向量化4维的blob，然后再向量化kernel最后再按照二维点乘的方式计算。first vectorizing the 3D subtensor of W corresponding to the nth output channel, then vectorizing I (denoted as vec( I )), and finally stretching the first vector to match the dimension of two vectors.当W稀疏时，就变成了一个sparse-vector-dense-vector点乘。

Consider flattening dimensions except the first one of W into a sparse matrix W (1) (i.e. mode-1 matricization of W as in Kolda & Bader (2009)), with its row vectors stretched to match the dimension of vec( I ). 将W按第一维展平，得稀疏矩阵W (1) ，并且row vectors 和vec( I )的维数匹配。
O (n, y, x) is then the dot-product between the nth row of W (1) and vec( I ).
Subsequently, the values at the same given (y, x)th position of all N output channels can be computed collectively as a sparse-matrix-dense-vector multiplication (SpMV):于是每个点(x,y)处的卷积结果可以计算多个sparse-matrix-dense-vector multiplication 来得到。

where I y,x denotes the tensor I with its last two dimensions shifted by (y, x). I y,x代表张量I根据(y,x)变化得到的向量化结果。
we operate with a virtual dense matrix, I(virtual) , where its columns are generated on the fly by adjusting indices through which we access vec( I ). 于是就可以定义一种虚拟的dense matrix，每个column都是快速且独立生成的。

全连接层的sparse加速

尽管大量计算属于卷积层（因此是本文的重点），但文中还简要讨论了在完全连接层中利用稀疏性。利用完全连接层中的稀疏性实际上比卷积层更简单，因为完全连接的层实现为GEMM，可以利用之前在稀疏矩阵和密集矩阵乘法（SpMDM）上完成的工作。

FLOP:Floating Point Operations Per Second,浮点操作/秒
SGEMM:Single precision floating General Matrix Multiply
SpMDM:Sparse-matrix-dense-matrix multiplication

与稀疏卷积类似，SpMDM的算术强度随稀疏度的增加而降低，其实际FLOP/s低于GEMM（General Matrix Multiply：广义矩阵乘法）。

作者简要讨论了CNN产生的稀疏矩阵与科学计算或图形分析中的稀疏矩阵的不同之处Davis＆Hu（2011）。科学计算中的稀疏矩阵通常具有导致高时间局部性的非零模式。然而，作者观察到稀疏CNN中的矩阵没有表现出这种带状非零模式，并且诸如反向Cuthill McKee Cuthill和McKee（1969）之类的重新排序算法几乎没有改善局部性。因此，现有的稀疏线性代数库（如英特尔MKL中针对科学矩阵进行了优化的那些）不能提供最佳性能，因此需要像我们这样的自定义实现。图形分析是广泛使用稀疏矩阵的另一个领域。图形分析中的稀疏矩阵要大得多，并且经常表现出“幂律”分布，它需要不同的数据结构，如双压缩稀疏列Buluç＆Gilbert（2008）。

作者观察到，在AlexNet的FC层中，输入特征也具有高稀疏度（高达85％），使用稀疏矩阵稀疏矩阵乘法（SpGEMM）进行激活。但是，评估的性能低于SpMDM。一个挑战是输出矩阵的大小是先验未知的，需要两次通过输入矩阵进行并行化（第一次通过以确定每个输出矩阵行的非零数和第二次通过实际计算）Gustavson（ 1978年）。有一些方法可以解决，例如通过输出密集矩阵或每个线程有一个单独的输出稀疏矩阵，但它们并非没有自己的开销。因此，即使是最新Xeon处理器上最先进的SpGEMM实现，其速度明显快于GPU同类产品，也不会达到20 GFLOP / s以下Patwary等人。（2015年）。此外，FC层输入的稀疏性高度依赖于激活功能。尽管ReLU在负侧具有零斜率目前是激活功能的流行选择，导致高稀疏性，但我们不能确定这将继续。仍然有趣的是，看看是否可以克服这些挑战，SpGEMM（Generalized sparse matrix-matrix multiplication ）可以进一步提高稀疏CNN中FC层的性能。

A evaluation model

The performance of sparse convolution depends highly on the sparsity level of the weight tensor.
to determine the appropriate target sparsity range for pruning and to project theoretical speedup for any given sparsity, using the roofline model.采用roofline model
令FLOP的数量为C，the size of input and output activation tensors as S(A) (in Bytes)，输入和输出的张量size为S(A)(Byte格式)，W的size为S(W)，均没有考虑sparsity。
We denote the density of non-zero in filters as x (the lower the x, the higher the sparsity of
weight tensor), the compute capability of processor as F (in FLOP/s), and the memory bandwidth as B (in B/s).令filter中非零元素的密度为x，计算速度为F(FLOP每秒)，memory bandwidth 为B(B/s)，t(dense)代表he time for dense convolution，t(sparse_compute)代表the time for sparse convolution bound by compute，by bandwidth为t(sparse_bw)，那么理论上的速度提升可由如下表达式计算：
where α and β denote the compute and storage overheads of sparse representations, respectively。
We observe α ∼ 3 on a Xeon E5-2697 v4 processor, and β is typically 2 (in compressed sparse row representation, we need 4B column index for each 4B single-precision floating point non-zero value)
there is an upper bound of useful sparsity, and a sparsity higher than it does not provide additional speedup, while only making training more challenging to preserve accuracy. 有效的稀疏有上界
This upper bound can be found by solving for x such that t(sparse_compute)= t(sparse_bw)(e.g.,the upper bound sparsity for conv5 of AlexNet on the Xeon is x ∼ 0.02).上界可解t(sparse_compute)= t(sparse_bw)得到，AlexNet on Xeon 的上界为x ∼ 0.02。
The speedup to sparsity relation also varies over layers. For example, since 1×1 convolutions in GoogLeNet has low arithmetic intensity to begin with, its performance quickly becomes bandwidth bound at lower sparsity (or higher x).每层对稀疏度的敏感程度也有不同。
there is a lower bound of useful sparsity such that, with a sparsity lower than that, sparse convolution becomes slower than dense convolution.稀疏度还有下界，稀疏度太低速度反而不及dense conv，Since t (sparse_compute) > t(dense) for x > 1/α.
The previous section described our sparse convolution implementation that achieves α=3 (since α is the compute overhead, lower is better) on the Xeon instead of α=100 as conjectured by Szegedy et al. (2015) 2 .

GUIDED SPARSITY LEARNING (GSL)

For example, layers like the first layer in AlexNet and GoogLeNet may not provide enough sparsity regardless of the amount of regularization applied as long as the original inference accuracy is to be preserved. 浅层特征很重要，所以在第一层一般不会进行太多稀疏化。
When GSL is used with element-wise regularization for pruning, thus denoted as Guided Element-wise Sparsity Learning (GESL), it learns the element-wise sparsity of layers where the model predicts speedups.
GSL can also be used with regularization methods that are more complicated than basic ridge and lasso regularization. GSL能够应用于更加复杂的regularization。
Although GSL as described above aims primarily at inference speed, GSL can balance the implications of pruning on inference speed, accuracy, and model size. To do this, optional constraints can be given to GSL to prioritize pruning of different layers in the network. For example, by using different regularization strengths on conv and fc, we can tune the priorities on speed and model size. 虽然GSL aims at 前向传播速度，但是他也可以维持剪枝所造成的速度/精度/model size变化之间的平衡性。要做到这一点，可以事先设置一些限制，例如对conv和fc采用不同强度的regularization，可以控制速度和model size 之间的优先性。

Experients

Conclusion

CNN虽然很强大，应用也非常广泛，但是通常对计算要求很高。
作为后续处理步骤的修剪操作在大幅减小模型尺寸同时适度地提高推断速度方面是有效的。
这篇文章的目标是通过修剪卷积内核减少FLOP计数，从而更充分地实现潜在的性能优势。
此文将高性能的直接稀疏卷积方法与性能模型相结合，并提出了一种引导方法，以共同设计的方式修剪CNN，可用于不同的计算机体系结构和所讨论的CNN的不同层。
特别是，文中在各种平台上展示了AlexNet中的3.1-7.3×卷积加速（所有这些都与广泛优化的密集线性代数运算相比）。
展望未来，正如此文所示，除了减小模型尺寸之外，修剪可以显着提高推理速度，因此应该探索修剪的其他技术。虽然本文提出的的直接稀疏卷积算法是成功的，但性能模型还表明，稀疏卷积不能加速所有卷积层，如GoogLeNet中的1×1卷积所示。
作者计划扩展他们的性能模型，以涵盖其他FLOP减少方法，如FFT，Winograd和张量因子分解，以便他们可以做出明智的决策，为每一层选择性能最佳的方法，并相应地指导训练过程。

补充：几种矩阵存储模式

COO: Coordinate

COO: 就是把矩阵中不为0的数的行号，列号和数值对应存储下来，如下图：

COO优点：比较容易转换成其他的稀疏矩阵存储格式（CSR等），写程序将libsvm格式的数据转换成COO比较容易，应该是充当libsvm与其他稀疏矩阵存储格式转换的媒介。
COO缺点：不能进行矩阵运算。

读libsvm格式数据：

# 读取libsvm格式数据成稀疏矩阵形式
# 0 5:1 9:1 140858:1 445908:1 446177:1 446293:1 449140:1 490778:1 491626:1 491634:1 491641:1 491645:1 491648:1 491668:1 491700:1 491708:1
def read_data(file_name):X = []D = []y = []with open(file_name) as fin:for line in fin:fields = line.strip().split()y_i = int(fields[0])X_i = [int(x.split(':')[0]) for x in fields[1:]]D_i = [int(x.split(':')[1]) for x in fields[1:]]y.append(y_i)X.append(X_i)D.append(D_i)y = np.reshape(np.array(y), [-1])X = libsvm_2_coo(zip(X, D), (len(X), INPUT_DIM)).tocsr()return X, y

libsvm转换成COO代码：

def libsvm_2_coo(libsvm_data, shape):coo_rows = []coo_cols = []coo_data = []n = 0for x, d in libsvm_data:coo_rows.extend([n] * len(x))coo_cols.extend(x)coo_data.extend(d)n += 1coo_rows = np.array(coo_rows)coo_cols = np.array(coo_cols)coo_data = np.array(coo_data)return coo_matrix((coo_data, (coo_rows, coo_cols)), shape=shape)

注：最后一行coo_matrix()一定要指定shape,因为coo只保留了有值的坐标，不指定shape无法还原矩阵。

CSR：按行压缩存储

CSR是比较标准的一种，也需要三类数据来表达：数值，列号，以及行偏移。CSR不是三元组，而是整体的编码方式。数值和列号与COO一致，表示一个元素以及其列号，行偏移表示某一行的第一个元素在values里面的起始偏移位置。row offsetS的数值个数是 #row+1,如上图中，0,2,4,7分别表示第一行、第二行、第三行、第四行的第一个非零元素在values的位置。第一行元素1是0偏移，第二行元素2是2偏移，第三行元素5是4偏移，第4行元素6是7偏移。在行偏移的最后补上矩阵总的元素个数，本例中是9。

CSR优点：高效的CSR + CSR,CSR*CSR算术运算；高效的行切片操作；高效的矩阵内积内积操作。但是列切片操作慢（相比CSC）；稀疏结构的变化代价高（相比LIL 或者 DOK）。CSR格式在存储稀疏矩阵时非零元素平均使用的字节数(Bytes per Nonzero Entry)最为稳定（float类型约为8.5，double类型约为12.5）。CSR格式常用于读入数据后进行稀疏矩阵计算。

假设B为大矩阵，S为小矩阵。
当CSR格式时，S×B速度较快，与B×S相比节约了一半时间。
当CSC格式时，B×S速度较快，与S×B相比节约一半时间。

CSC:按列压缩存储

CSC是和CSR相对应的一种方式，即按列压缩的意思。
以上图中矩阵为例：
Values：[1 5 7 2 6 8 3 9 4]
Row Indices：[0 2 0 1 3 1 2 2 3]
Column Offsets：[0 2 5 7 9]