Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference (纯整数算术计算)

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, Dmitry Kalenichenko

post-training quantization，PTQ：训练后量化
quantization-aware training，QAT：量化感知训练，训练时量化
Google LLC
Alphabet Inc.：字母表公司
Limited Liability Company，LLC：有限责任公司
Incorporated，Inc.：股份有限公司
incorporate [ɪnˈkɔː(r)pəreɪt]：v. 包含，吸收，将 ... 包括在内，使并入 adj. 具体化了的，法定组织的，无形的，精神上的
liability [.laɪə'bɪləti]：n. 负债，债务，义务，惹麻烦的人 (或事)
arithmetic [.ærɪθ'metɪk]：n. 算术，算术运算，四则运算

Abstract

The rising popularity of intelligent mobile devices and the daunting computational cost of deep learning-based models call for efficient and accurate on-device inference schemes. We propose a quantization scheme that allows inference to be carried out using integer-only arithmetic, which can be implemented more efficiently than floating point inference on commonly available integer-only hardware. We also co-design a training procedure to preserve end-to-end model accuracy post quantization. As a result, the proposed quantization scheme improves the tradeoff between accuracy and on-device latency. The improvements are significant even on MobileNets, a model family known for run-time efficiency, and are demonstrated in ImageNet classification and COCO detection on popular CPUs.
Intelligent mobile devices 的日益普及和 deep learning-based models 令人生畏的计算成本，要求 efficient and accurate on-device inference schemes。We propose a quantization scheme that allows inference to be carried out using integer-only arithmetic, which can be implemented more efficiently than floating point inference on commonly available integer-only hardware. 我们还共同设计了一个训练程序，以在量化后保持端到端模型的准确性。因此，所提出的量化方案改善了准确性和设备延迟之间的平衡。The improvements are significant even on MobileNets, a model family known for run-time efficiency, and are demonstrated in ImageNet classification and COCO detection on popular CPUs.

popularity [ˌpɒpjʊˈlærəti]：n. 流行，受欢迎，普及
daunt [dɔːnt]：vt. 使胆怯，使气馁，使失去信心
preserve [prɪˈzɜː(r)v]：v. 保留，保护，保存，维护 n. 专门领域，果酱，腌菜，泡菜
significant [sɪɡ'nɪfɪkənt]：adj. 有重大意义的，显著的，有某种意义的，别有含义的
recipe ['resəpi]：n. 食谱，方法，诀窍，烹饪法

1. Introduction

Current state-of-the-art Convolutional Neural Networks (CNNs) are not well suited for use on mobile devices. Since the advent of AlexNet [20], modern CNNs have primarily been appraised according to classification / detection accuracy. Thus network architectures have evolved without regard to model complexity and computational efficiency. On the other hand, successful deployment of CNNs on mobile platforms such as smartphones, AR/VR devices (HoloLens, Daydream), and drones require small model sizes to accommodate limited on-device memory, and low latency to maintain user engagement. This has led to a burgeoning field of research that focuses on reducing the model size and inference time of CNNs with minimal accuracy losses.
Current state-of-the-art Convolutional Neural Networks (CNNs) 不太适合在移动设备上使用。自 AlexNet [20] 出现以来，modern CNNs 主要根据 classification / detection accuracy 进行评估。因此，network architectures 的发展没有考虑 model complexity and computational efficiency。另一方面，在 smartphones, AR/VR devices (HoloLens, Daydream), and drones 等移动平台上成功部署 CNNs 需要较小的模型尺寸以适应有限的 on-device memory，以及低延迟以保持用户参与度。这导致了一个新兴的研究领域，专注于以最小的精度损失减少 the model size and inference time of CNNs。

Google Daydream
Microsoft HoloLens
state-of-the-art：adj. 最先进的，使用最先进技术的，体现最高水平的
appraise [əˈpreɪz]：vt. 评价，估价，估量，(对某人的工作) 作出评价
advent [ˈædvent]：n. 到来，出现，基督降临节
drone [drəʊn]：n. 嗡嗡声，雄蜂，持续低音，寄生虫，无人驾驶飞机 vi. 嗡嗡叫，嗡嗡响
engagement [ɪnˈɡeɪdʒmənt]：n. 订婚，订婚期间，约定，约会，预约，战斗，交战，（与 … 的）密切关系，雇用
burgeon [ˈbɜːdʒən]：vi. 激增，迅速发展 n. 嫩芽，嫩枝
evolve [iˈvɒlv]：v. 发展，进化，(使) 逐渐形成，逐渐演变

Approaches in this field roughly fall into two categories. The first category, exemplified by MobileNet [10], SqueezeNet [16], ShuffleNet [32], and DenseNet [11], designs novel network architectures that exploit computation / memory efficient operations. The second category quantizes the weights and / or activations of a CNN from 32 bit floating point into lower bit-depth representations. This methodology, embraced by approaches such as Ternary weight networks (TWN [22]), Binary Neural Networks (BNN [14]), XNOR-net [27], and more [8, 21, 26, 33, 34, 35], is the focus of our investigation. Despite their abundance, current quantization approaches are lacking in two respects when it comes to trading off latency with accuracy.
在这个领域中的方法大致分为两类。The first category, 以 MobileNet [10], SqueezeNet [16], ShuffleNet [32], and DenseNet [11] 为代表，设计 novel network architectures 利用 computation / memory efficient operations. The second category quantizes the weights and / or activations of a CNN from 32 bit floating point into lower bit-depth representations。这种思想被 Ternary weight networks (TWN [22]), Binary Neural Networks (BNN [14]), XNOR-net [27] 等更多算法 [8, 21, 26, 33, 34, 35] 所采用，是我们调查的重点。尽管数量众多，但当前的量化方法在权衡延迟和准确性方面存在两个方面的不足。

exemplify [ɪɡˈzemplɪfaɪ]：vt. 举例说明，例证，例示，是 ... 的典型 (或典范、榜样)
methodology [ˌmeθəˈdɒlədʒi]：n. 方法论，(从事某一活动的) 方法，原则
embrace [ɪmˈbreɪs]：v. 抱，拥抱，欣然接受，乐意采纳 (思想、建议等)，信奉 (宗教、信仰等)，包括，包含 n. 拥抱，怀抱
ternary ['tɜːnərɪ]：adj. 三个的，第三的，三元成分的，三元的
abundance [əˈbʌndəns]：n. 大量，丰盛，充裕
despite [dɪˈspaɪt]：prep. 即使，尽管，尽管 (自己) 不愿意 n. 怨恨，恶意，轻蔑，侮辱

First, prior approaches have not been evaluated on a reasonable baseline architecture. The most common baseline architectures, AlexNet [20], VGG [28] and GoogleNet [29], are all over-parameterized by design in order to extract marginal accuracy improvements. Therefore, it is easy to obtain sizable compression of these architectures, reducing quantization experiments on these architectures to proof-of-concepts at best. Instead, a more meaningful challenge would be to quantize model architectures that are already efficient at trading off latency with accuracy, e.g. MobileNets.
First, 先前的方法尚未在合理的基准架构上进行评估。最常见的基准架构 (baseline architectures), AlexNet [20], VGG [28] and GoogleNet [29] 都通过设计而过度参数化 (over-parameterized)，以提取微不足道的精度改进。因此，很容易获得这些架构的相当大的压缩，在这些架构上做量化实验只能算是概念验证。相反，一个更有意义的挑战是量化那些架构已经可以有效地权衡延迟和准确性的模型架构，例如 MobileNets。

prior [ˈpraɪə(r)]：adj. 先前的，较早的，在前的，优先的 n. 院长，会长 adv. 居先
marginal [ˈmɑː(r)dʒɪn(ə)l]：adj. 小的，微不足道的，不重要的，非主体的 n. 边缘席位
proof of concept，POC or PoC：概念证明，概念验证
proof of principle，POP or PoP：原理验证，原理论证
at best：充其量，顶多，至多

Second, many quantization approaches do not deliver verifiable efficiency improvements on real hardware. Approaches that quantize only the weights ([2, 4, 8, 33]) are primarily concerned with on-device storage and less with computational efficiency. Notable exceptions are binary, ternary and bit-shift networks [14, 22, 27]. These latter approaches employ weights that are either 0 or powers of 2, which allow multiplication to be implemented by bit shifts. However, while bit-shifts can be efficient in custom hardware, they provide little benefit on existing hardware with multiply-add instructions that, when properly used (i.e. pipelined), are not more expensive than additions alone. Moreover, multiplications are only expensive if the operands are wide, and the need to avoid multiplications diminishes with bit depth once both weights and activations are quantized. Notably, these approaches rarely provide on-device measurements to verify the promised timing improvements. More runtime-friendly approaches quantize both the weights and the activations into 1 bit representations [14, 27, 34]. With these approaches, both multiplications and additions can be implemented by efficient bit-shift and bit-count operations, which are showcased in custom GPU kernels (BNN [14]). However, 1 bit quantization often leads to substantial performance degradation, and may be overly stringent on model representation.
Second, many quantization approaches 无法在真实硬件上提供可验证的效率改进。仅量化权重 ([2, 4, 8, 33]) 的方法主要关注设备上的存储，而不是计算效率。值得注意的例外是 binary, ternary and bit-shift networks [14, 22, 27]. 后面这些方法采用权重为 0 值或 2 的幂，这允许通过位移来实现乘法。然而，虽然 bit-shifts 在定制硬件中可能是有效的，但它们对现有乘加指令的硬件几乎没有益处，multiply-add instructions 如果被正确使用 (即流水线)，其计算成本并不比单独的加法高。而且，multiplications are only expensive if the operands are wide，一旦权重和激活都被量化，避免乘法的需要就会随着位深度减少而降低。值得注意的是，这些方法很少提供设备上测量结果来验证承诺的时序改进。更多更适合运行时的方法将权重和激活量化为 1 bit 表示 [14, 27, 34]。使用这些方法，乘法和加法都可以通过有效的位移和位计数操作来实现，这在定制 GPU 内核 (BNN [14]) 中得到了应用。然而，1 bit quantization 通常会导致性能大幅下降，并且可能对模型表示过于严苛。

diminish [dɪ'mɪnɪʃ]：v. 减少，削弱，减小，贬低
promise [ˈprɒmɪs]：v. 承诺，保证，答应，许诺 n. 承诺，诺言，许诺，获得成功的迹象
overly [ˈəʊvə(r)li]：adv. 过于，很，十分
stringent [ˈstrɪndʒ(ə)nt]：adj. 严格的，严厉的，紧缩的，短缺的
substantial [səb'stænʃ(ə)l]：adj. 大量的，价值巨大的，重大的，大而坚固的
degradation [.deɡrə'deɪʃ(ə)n]：n. 退化，降低，降级，冲刷

In this paper we address the above issues by improving the latency-vs-accuracy tradeoffs of MobileNets on common mobile hardware. Our specific contributions are:

We provide a quantization scheme (section 2.1) that quantizesh both weights and activations as 8-bit integers, and just a few parameters (bias vectors) as 32-bit integers.
将权重和激活量化为 8-bit 整数，并且仅较少参数 (bias vectors) 量化为 32-bit 整数。
We provide a quantized inference framework that is efficiently implementable on integer-arithmetic-only hardware such as the Qualcomm Hexagon (sections 2.2, 2.3), and we describe an efficient, accurate implementation on ARM NEON (Appendix B).
纯整数算术运算硬件。
We provide a quantized training framework (section 3) co-designed with our quantized inference to minimize the loss of accuracy from quantization on real models.
最大程度地减少在真实模型上因量化带来的精度损失。
We apply our frameworks to efficient classification and detection systems based on MobileNets and provide benchmark results on popular ARM CPUs (section 4) that show significant improvements in the latency-vs-accuracy tradeoffs for state-of-the-art MobileNet architectures, demonstrated in ImageNet classification [3], COCO object detection [23], and other tasks.

Our work draws inspiration from [7], which leverages low-precision fixed-point arithmetic to accelerate the training speed of CNNs, and from [31], which uses 8-bit fixed-point arithmetic to speed up inference on x86 CPUs. Our quantization scheme focuses instead on improving the inference speed vs accuracy tradeoff on mobile CPUs.
我们的工作从 [7] 和 [31] 中汲取灵感。[7] 利用 low-precision fixed-point arithmetic 来加速 CNN 的训练速度，而 [31] 则使用 8-bit fixed-point arithmetic 来加速 x86 CPU 上的推理。相反，我们的量化方案侧重于提高 mobile CPUs 上的推理速度与准确性的平衡。

2. Quantized Inference

2.1. Quantization scheme

In this section, we describe our general quantization scheme1,2^{1, 2}1,2, that is, the correspondence between the bit-representation of values (denoted qqq below, for “quantized value”) and their interpretation as mathematical real numbers (denoted rrr below, for “real value”). Our quantization scheme is implemented using integer-only arithmetic during inference and floating-point arithmetic during training, with both implementations maintaining a high degree of correspondence with each other. We achieve this by first providing a mathematically rigorous definition of our quantization scheme, and separately adopting this scheme for both integer-arithmetic inference and floating-point training.
In this section，我们描述通用的量化方案，即 the bit-representation of values (以下用 qqq 表示量化值 (quantized value)) 和数学实数解释 (以下用 rrr 表示实数值 (real value)) 之间的对应关系。Our quantization scheme 是在推理过程中使用纯整数算术计算，在训练过程中使用 floating-point arithmetic 实现，两种实现都保持高度的相互对应。我们通过首先提供量化方案的数学严格定义来实现这一点，并分别采用该方案进行 integer-arithmetic inference and floating-point training。

1^11The quantization scheme described here is the one adopted in TensorFlow Lite [5] and we will refer to specific parts of its code to illustrate aspects discussed below.
此处描述的量化方案是 TensorFlow Lite [5] 中采用的量化方案。
2^22We had earlier described this quantization scheme in the documentation of gemmlowp [18]. That page may still be useful as an alternate treatment of some of the topics developed in this section, and for its self-contained example code.
我们之前在 gemmlowp [18] 的文档中描述了这种量化方案。该页面可能仍然可以用作本节中开发的某些主题的替代处理，以及它的自包含示例代码。

substantial [səb'stænʃ(ə)l]：adj. 大量的，价值巨大的，重大的，大而坚固的
degradation [.deɡrə'deɪʃ(ə)n]：n. 退化，降低，降级，冲刷
rigorous ['rɪɡərəs]：adj. 谨慎的，细致的，彻底的，严格的

A basic requirement of our quantization scheme is that it permits efficient implementation of all arithmetic using only integer arithmetic operations on the quantized values (we eschew implementations requiring lookup tables because these tend to perform poorly compared to pure arithmetic on SIMD hardware). This is equivalent to requiring that the quantization scheme be an affine mapping of integers qqq to real numbers rrr, i.e. of the form
r=S(q−Z)(1)r = S(q − Z) \tag{1}r=S(q−Z)(1)
for some constants SSS and ZZZ. Equation (1) is our quantization scheme and the constants SSS and ZZZ are our quantization parameters. Our quantization scheme uses a single set of quantization parameters for all values within each activations array and within each weights array; separate arrays use separate quantization parameters.
Our quantization scheme 的一个基本要求是，它允许对量化值仅使用整数算术运算来有效地实现所有计算 (我们避开需要查找表的实现，因为与 SIMD 硬件上的纯算术相比，这些实现往往表现不佳)。这相当于要求量化方案是整数 qqq 到实数 rrr 的仿射映射，即形式为
r=S(q−Z)(1)r = S(q − Z) \tag{1}r=S(q−Z)(1)
constants SSS and ZZZ。Equation (1) is our quantization scheme and the constants SSS and ZZZ are our quantization parameters。我们的量化方案对 each activations array and each weights array 中的所有值使用一组量化参数，单独的数组使用单独的量化参数。

For 8-bit quantization, qqq is quantized as an 8-bit integer (for BBB-bit quantization, qqq is quantized as an BBB-bit integer). Some arrays, typically bias vectors, are quantized as 32-bit integers, see section 2.4.
对于 BBB-bit 量化，qqq 被量化为 BBB-bit 整数。

The constant SSS (for “scale”) is an arbitrary positive real number. It is typically represented in software as a floating-point quantity, like the real values rrr. Section 2.2 describes methods for avoiding the representation of such floating-point quantities in the inference workload.
The constant SSS (for “scale - 缩放”) 是任意正实数。它通常在软件中表示为浮点数，如实值 rrr。2.2 节描述了避免在推理工作负载中表示此类浮点量的方法。

The constant ZZZ (for “zero-point”) is of the same type as quantized values qqq, and is in fact the quantized value qqq corresponding to the real value 0. This allows us to automatically meet the requirement that the real value r=0r = 0r=0 be exactly representable by a quantized value. The motivation for this requirement is that efficient implementation of neural network operators often requires zero-padding of arrays around boundaries.
The constant ZZZ (for “zero-point - 零点”) 与量化值 qqq 属于同一类型，实际上 ZZZ 是 the real value 0 对应的 the quantized value qqq。这样我们就可以自动满足要求 the real value r=0r = 0r=0 可以用量化值精确表示。这个要求的动机是神经网络算子的有效实现通常需要对边界周围的数组进行零填充。

Our discussion so far is summarized in the following quantized buffer data structure3^{3}3, with one instance of such a buffer existing for each activations array and weights array in a neural network. We use C++ syntax because it allows the unambiguous conveyance of types.
到目前为止，我们的讨论总结在以下 quantized buffer data structure 中，神经网络中的每个激活数组和权重数组都存在这种缓冲区的一个实例。我们使用 C++ 语法是因为它允许类型的明确传递。

3^33The actual data structures in the TensorFlow Lite [5] Converter are QuantizationParams and Array in this header file. As we discuss in the next subsection, this data structure, which still contains a floating-point quantity, does not appear in the actual quantized on-device inference code.
TensorFlow Lite [5] Converter 中的实际数据结构是头文件中的 QuantizationParams and Array。正如我们在下一小节中讨论的那样，这个仍然包含浮点数的数据结构不会出现在实际 quantized on-device inference code 中。

permit [pə(r)ˈmɪt]：v. 允许，准许，使有可能 n. 许可证
eschew [ɪsˈtʃuː]：v. 避免
unambiguous [.ʌnæm'bɪɡjʊəs]：adj. 意思清楚的，明确的，毫不含糊的，无歧义的
conveyance [kən'veɪəns]：n. 输送，运送，传送，运载工具

template<typename QType>  // e.g. QType=uint8
struct QuantizedBuffer {vector<QType> q;      // the quantized valuesfloat S;              // the scaleQType Z;              // the zero-point
};

2.2. Integer-arithmetic-only matrix multiplication

We now turn to the question of how to perform inference using only integer arithmetic, i.e. how to use Equation (1) to translate real-numbers computation into quantized-values computation, and how the latter can be designed to involve only integer arithmetic even though the scale values SSS are not integers.
将实数计算转换为量化值计算，即使 the scale values SSS 不是整数。

Consider the multiplication of two square N×NN × NN×N matrices of real numbers, r1r_1r1 and r2r_2r2, with their product represented by r3=r1r2r_3 = r_{1}r_{2}r3=r1r2. We denote the entries of each of these matrices rα(α=1,2or 3)r_{α} (α = 1, 2\ \text{or}\ 3)rα(α=1,2 or 3) as rα(i,j)r_{α}^{(i,j)}rα(i,j) for 1≤i,j≤N1 \leq i, j \leq N1≤i,j≤N, and the quantization parameters with which they are quantized as (Sα,Zα)(S_{α}, Z_{α})(Sα,Zα). We denote the quantized entries by qα(i,j)q_{α}^{(i,j)}qα(i,j). Equation (1) then becomes:

rα(i,j)=Sα(qα(i,j)−Zα).(2)r_{α}^{(i,j)} = S_{α}(q_{α}^{(i,j)} - Z_{α}). \tag{2}rα(i,j)=Sα(qα(i,j)−Zα).(2)

公式推导：
S3(q3(i,k)−Z3)=∑j=1NS1(q1(i,j)−Z1)S2(q2(j,k)−Z2)(q3(i,k)−Z3)=∑j=1NS1(q1(i,j)−Z1)S2(q2(j,k)−Z2)S3q3(i,k)=∑j=1NS1(q1(i,j)−Z1)S2(q2(j,k)−Z2)S3+Z3q3(i,k)=Z3+∑j=1NS1(q1(i,j)−Z1)S2(q2(j,k)−Z2)S3q3(i,k)=Z3+S1S2S3∑j=1N(q1(i,j)−Z1)(q2(j,k)−Z2)\begin{aligned} S_{3}(q_{3}^{(i,k)} - Z_{3}) &= \sum_{j=1}^{N}S_{1}(q_{1}^{(i,j)} - Z_{1})S_{2}(q_{2}^{(j,k)} - Z_{2}) \\ (q_{3}^{(i,k)} - Z_{3}) &= \frac {\sum_{j=1}^{N}S_{1}(q_{1}^{(i,j)} - Z_{1})S_{2}(q_{2}^{(j,k)} - Z_{2})} {S_{3}} \\ q_{3}^{(i,k)} &= \frac {\sum_{j=1}^{N}S_{1}(q_{1}^{(i,j)} - Z_{1})S_{2}(q_{2}^{(j,k)} - Z_{2})} {S_{3}} + Z_{3} \\ q_{3}^{(i,k)} &= Z_{3} + \frac {\sum_{j=1}^{N}S_{1}(q_{1}^{(i,j)} - Z_{1})S_{2}(q_{2}^{(j,k)} - Z_{2})} {S_{3}} \\ q_{3}^{(i,k)} &= Z_{3} + \frac{S_{1}S_{2}}{S_{3}} {\sum_{j=1}^{N}(q_{1}^{(i,j)} - Z_{1})(q_{2}^{(j,k)} - Z_{2})} \\ \end{aligned} S3(q3(i,k)−Z3)(q3(i,k)−Z3)q3(i,k)q3(i,k)q3(i,k)=j=1∑NS1(q1(i,j)−Z1)S2(q2(j,k)−Z2)=S3∑j=1NS1(q1(i,j)−Z1)S2(q2(j,k)−Z2)=S3∑j=1NS1(q1(i,j)−Z1)S2(q2(j,k)−Z2)+Z3=Z3+S3∑j=1NS1(q1(i,j)−Z1)S2(q2(j,k)−Z2)=Z3+S3S1S2j=1∑N(q1(i,j)−Z1)(q2(j,k)−Z2)

q3(i,k)=Z3+∑j=1NS1(q1(i,j)−Z1)S2(q2(j,k)−Z2)S3q3(i,k)=Z3+S1S2S3∑j=1N(q1(i,j)−Z1)(q2(j,k)−Z2)\begin{aligned} q_{3}^{(i,k)} &= Z_{3} + \frac {\sum_{j=1}^{N}S_{1}(q_{1}^{(i,j)} - Z_{1})S_{2}(q_{2}^{(j,k)} - Z_{2})} {S_{3}} \\ q_{3}^{(i,k)} &= Z_{3} + \frac{S_{1}S_{2}}{S_{3}} {\sum_{j=1}^{N}(q_{1}^{(i,j)} - Z_{1})(q_{2}^{(j,k)} - Z_{2})} \\ \end{aligned} q3(i,k)q3(i,k)=Z3+S3∑j=1NS1(q1(i,j)−Z1)S2(q2(j,k)−Z2)=Z3+S3S1S2j=1∑N(q1(i,j)−Z1)(q2(j,k)−Z2)

From the definition of matrix multiplication, we have
S3(q3(i,k)−Z3)=∑j=1NS1(q1(i,j)−Z1)S2(q2(j,k)−Z2),(3)S_{3}(q_{3}^{(i,k)} - Z_{3}) = \sum_{j=1}^{N}S_{1}(q_{1}^{(i,j)} - Z_{1})S_{2}(q_{2}^{(j,k)} - Z_{2}), \tag{3}S3(q3(i,k)−Z3)=j=1∑NS1(q1(i,j)−Z1)S2(q2(j,k)−Z2),(3)
which can be rewritten as
q3(i,k)=Z3+M∑j=1N(q1(i,j)−Z1)(q2(j,k)−Z2),(4)q_{3}^{(i,k)} = Z_{3} + M {\sum_{j=1}^{N}(q_{1}^{(i,j)} - Z_{1})(q_{2}^{(j,k)} - Z_{2})}, \tag{4}q3(i,k)=Z3+Mj=1∑N(q1(i,j)−Z1)(q2(j,k)−Z2),(4)
where the multiplier MMM is defined as
M:=S1S2S3.(5)M := \frac{S_{1}S_{2}}{S_{3}}. \tag{5}M:=S3S1S2.(5)

In Equation (4), the only non-integer is the multiplier MMM. As a constant depending only on the quantization scales S1S1S1, S2S2S2, S3S3S3, it can be computed offline. We empirically find it to always be in the interval (0,1)(0, 1)(0,1), and can therefore express it in the normalized form
M=2−nM0=M02n(6)M = 2^{-n}M_{0} = \frac{M_{0}}{2^{n}} \tag{6}M=2−nM0=2nM0(6)
where M0M_{0}M0 is in the interval [0.5,1)[0.5, 1)[0.5,1) and nnn is a non-negative integer.
唯一的非整数是乘数 MMM，MMM 是仅取决于 quantization scales S1S1S1, S2S2S2, S3S3S3 的常数，可以离线计算。我们依据经验发现 MMM 总是在区间 (0,1)(0, 1)(0,1) 中，因此可以用归一化形式表示。

The normalized multiplier M0M_{0}M0 now lends itself well to being expressed as a fixed-point multiplier (e.g. int16 or int32 depending on hardware capability). For example, if int32 is used, the integer representing M0M_{0}M0 is the int32 value nearest to 231M02^{31}M_{0}231M0. Since M0≥0.5M_{0} \geq 0.5M0≥0.5, this value is always at least 2302^{30}230 and will therefore always have at least 30 bits of relative accuracy. Multiplication by M0M_{0}M0 can thus be implemented as a fixed-point multiplication4^44. Meanwhile, multiplication by 2−n2^{-n}2−n can be implemented with an efficient bit-shift, albeit one that needs to have correct round-to-nearest behavior, an issue that we return to in Appendix B.
The normalized multiplier M0M_{0}M0 现在可以很好地表示为定点乘数 (e.g. int16 or int32 具体取决于硬件能力)。例如，如果使用 int32，则表示 M0M_{0}M0 的整数是最接近 231M02^{31}M_{0}231M0 的 int32 值。由于 M0≥0.5M_{0} \geq 0.5M0≥0.5，该值始终至少为 2302^{30}230，因此始终具有至少 30 位的相对精度。因此，M0M_{0}M0 的乘法可以实现为定点乘法。同时，乘以 2−n2^{-n}2−n 可以通过有效的位移来实现，尽管它需要具有正确的舍入到最近行为，我们将在附录 B 中讨论这个问题。

4^44The computation discussed in this section is implemented in TensorFlow Lite [5] reference code for a fully-connected layer.

albeit [ɔːlˈbiːɪt]：conj. 尽管，虽然

2.3. Efficient handling of zero-points

In order to efficiently implement the evaluation of Equation (4) without having to perform 2N32N^{3}2N3 subtractions and without having to expand the operands of the multiplication into 16-bit integers, we first notice that by distributing the multiplication in Equation (4), we can rewrite it as
为了有效地实现对等式 (4) 的求值，而不必执行 2N32N^{3}2N3 减法并且不必将乘法的操作数扩展为 16 位整数，我们首先注意到通过乘法分配率，我们可以将其改写为
q3(i,k)=Z3+M(NZ1Z2−Z1a2(k)−Z2a‾1(i)+∑j=1Nq1(i,j)q2(j,k))(7)q_{3}^{(i,k)} = Z_{3} + M { (NZ_{1}Z_{2} - Z_{1}a_{2}^{(k)} - Z_{2} \overline{a}_{1}^{(i)} + \sum_{j=1}^{N}q_{1}^{(i,j)}q_{2}^{(j,k)}) } \tag{7}q3(i,k)=Z3+M(NZ1Z2−Z1a2(k)−Z2a1(i)+j=1∑Nq1(i,j)q2(j,k))(7)

公式推导：
q3(i,k)=Z3+S1S2S3∑j=1N(q1(i,j)−Z1)(q2(j,k)−Z2)q3(i,k)=Z3+S1S2S3∑j=1N(q1(i,j)q2(j,k)−Z1q2(j,k)−Z2q1(i,j)+Z1Z2)q3(i,k)=Z3+S1S2S3∑j=1N(Z1Z2−Z1q2(j,k)−Z2q1(i,j)+q1(i,j)q2(j,k))q3(i,k)=Z3+S1S2S3(NZ1Z2−Z1∑j=1Nq2(j,k)−Z2∑j=1Nq1(i,j)+∑j=1Nq1(i,j)q2(j,k))\begin{aligned} q_{3}^{(i,k)} &= Z_{3} + \frac{S_{1}S_{2}}{S_{3}} { \sum_{j=1}^{N}(q_{1}^{(i,j)} - Z_{1})(q_{2}^{(j,k)} - Z_{2}) } \\ q_{3}^{(i,k)} &= Z_{3} + \frac{S_{1}S_{2}}{S_{3}} { \sum_{j=1}^{N}(q_{1}^{(i,j)}q_{2}^{(j,k)} - Z_{1}q_{2}^{(j,k)} - Z_{2}q_{1}^{(i,j)} + Z_{1}Z_{2}) } \\ q_{3}^{(i,k)} &= Z_{3} + \frac{S_{1}S_{2}}{S_{3}} { \sum_{j=1}^{N}(Z_{1}Z_{2} - Z_{1}q_{2}^{(j,k)} - Z_{2}q_{1}^{(i,j)} + q_{1}^{(i,j)}q_{2}^{(j,k)}) } \\ q_{3}^{(i,k)} &= Z_{3} + \frac{S_{1}S_{2}}{S_{3}} { (NZ_{1}Z_{2} - Z_{1}\sum_{j=1}^{N}q_{2}^{(j,k)} - Z_{2}\sum_{j=1}^{N}q_{1}^{(i,j)} + \sum_{j=1}^{N}q_{1}^{(i,j)}q_{2}^{(j,k)}) } \\ \end{aligned} q3(i,k)q3(i,k)q3(i,k)q3(i,k)=Z3+S3S1S2j=1∑N(q1(i,j)−Z1)(q2(j,k)−Z2)=Z3+S3S1S2j=1∑N(q1(i,j)q2(j,k)−Z1q2(j,k)−Z2q1(i,j)+Z1Z2)=Z3+S3S1S2j=1∑N(Z1Z2−Z1q2(j,k)−Z2q1(i,j)+q1(i,j)q2(j,k))=Z3+S3S1S2(NZ1Z2−Z1j=1∑Nq2(j,k)−Z2j=1∑Nq1(i,j)+j=1∑Nq1(i,j)q2(j,k))

S3(q3(i,k)−Z3)=∑j=1NS1(q1(i,j)−Z1)S2(q2(j,k)−Z2)q3(i,k)=Z3+S1S2S3∑j=1N(q1(i,j)−Z1)(q2(j,k)−Z2)q3(i,k)=Z3+S1S2S3∑j=1N(Z1Z2−Z1q2(j,k)−Z2q1(i,j)+q1(i,j)q2(j,k))q3(i,k)=Z3+S1S2S3(NZ1Z2−Z1∑j=1Nq2(j,k)−Z2∑j=1Nq1(i,j)+∑j=1Nq1(i,j)q2(j,k))\begin{aligned} S_{3}(q_{3}^{(i,k)} - Z_{3}) &= \sum_{j=1}^{N}S_{1}(q_{1}^{(i,j)} - Z_{1})S_{2}(q_{2}^{(j,k)} - Z_{2}) \\ q_{3}^{(i,k)} &= Z_{3} + \frac{S_{1}S_{2}}{S_{3}} { \sum_{j=1}^{N}(q_{1}^{(i,j)} - Z_{1})(q_{2}^{(j,k)} - Z_{2}) } \\ q_{3}^{(i,k)} &= Z_{3} + \frac{S_{1}S_{2}}{S_{3}} { \sum_{j=1}^{N}(Z_{1}Z_{2} - Z_{1}q_{2}^{(j,k)} - Z_{2}q_{1}^{(i,j)} + q_{1}^{(i,j)}q_{2}^{(j,k)}) } \\ q_{3}^{(i,k)} &= Z_{3} + \frac{S_{1}S_{2}}{S_{3}} { (NZ_{1}Z_{2} - Z_{1}\sum_{j=1}^{N}q_{2}^{(j,k)} - Z_{2}\sum_{j=1}^{N}q_{1}^{(i,j)} + \sum_{j=1}^{N}q_{1}^{(i,j)}q_{2}^{(j,k)}) } \\ \end{aligned} S3(q3(i,k)−Z3)q3(i,k)q3(i,k)q3(i,k)=j=1∑NS1(q1(i,j)−Z1)S2(q2(j,k)−Z2)=Z3+S3S1S2j=1∑N(q1(i,j)−Z1)(q2(j,k)−Z2)=Z3+S3S1S2j=1∑N(Z1Z2−Z1q2(j,k)−Z2q1(i,j)+q1(i,j)q2(j,k))=Z3+S3S1S2(NZ1Z2−Z1j=1∑Nq2(j,k)−Z2j=1∑Nq1(i,j)+j=1∑Nq1(i,j)q2(j,k))

where
a2(k):=∑j=1Nq2(j,k),a‾1(i):=∑j=1Nq1(i,j)(8)a_{2}^{(k)} := \sum_{j=1}^{N}q_{2}^{(j,k)}, \ \overline{a}_{1}^{(i)} := \sum_{j=1}^{N}q_{1}^{(i,j)} \tag{8}a2(k):=j=1∑Nq2(j,k), a1(i):=j=1∑Nq1(i,j)(8)

Each a2(k)a_{2}^{(k)}a2(k) or a‾1(i)\overline{a}_{1}^{(i)}a1(i) takes only NNN additions to compute, so they collectively take only 2N22N^{2}2N2 additions. The rest of the cost of the evaluation of (7) is almost entirely concentrated in the core integer matrix multiplication accumulation
∑j=1Nq1(i,j)q2(j,k)(9)\sum_{j=1}^{N}q_{1}^{(i,j)}q_{2}^{(j,k)} \tag{9}j=1∑Nq1(i,j)q2(j,k)(9)
which takes 2N32N^{3}2N3 arithmetic operations; indeed, everything else involved in (7) is O(N2)O(N ^{2})O(N2) with a small constant in the OOO. Thus, the expansion into the form (7) and the factored-out computation of a2(k)a_{2}^{(k)}a2(k) and a‾1(i)\overline{a}_{1}^{(i)}a1(i) enable low-overhead handling of arbitrary zero-points for anything but the smallest values of NNN, reducing the problem to the same core integer matrix multiplication accumulation (9) as we would have to compute in any other zero-points-free quantization scheme.
这需要 2N32N^{3}2N3 算术运算。实际上，(7) 中涉及的所有其他内容都是 O(N2)O(N ^{2})O(N2)，在 OOO 中的常数很小。因此，展开到形式 (7)，因式分解为 a2(k)a_{2}^{(k)}a2(k) and a‾1(i)\overline{a}_{1}^{(i)}a1(i) 的计算可实现除了最小的 NNN 值外，对于任意零点的低开销处理，将问题归结为 (9) 式所示的核心整数矩阵乘法累加，就像在任何其他无零点量化方案中必须进行的计算一样。

2.4. Implementation of a typical fused layer

We continue the discussion of section 2.3, but now explicitly define the data types of all quantities involved, and modify the quantized matrix multiplication (7) to merge the bias-addition and activation function evaluation directly into it. This fusing of whole layers into a single operation is not only an optimization. As we must reproduce in inference code the same arithmetic that is used in training, the granularity of fused operators in inference code (taking an 8-bit quantized input and producing an 8-bit quantized output) must match the placement of “fake quantization” operators in the training graph (section 3).
现在明确定义所有涉及的量的数据类型，并修改量化矩阵乘法 (7) 以将偏差加法和激活函数评估直接合并到其中。将整个层融合到单个操作中不仅是一种优化。由于我们必须在推理代码中重现与训练中使用的算法相同的算法，因此推理代码中融合运算符的粒度 (采用 8 位量化输入并产生 8 位量化输出) 必须与训练图中的“伪量化”算子的位置相匹配 (section 3)。

For our implementation on ARM and x86 CPU architectures, we use the gemmlowp library [18], whose GemmWithOutputPipeline entry point provides supports the fused operations that we now describe5^{5}5.

5^55The discussion in this section is implemented in TensorFlow Lite [5] for e.g. a Convolutional operator (reference code is self-contained, optimized code calls into gemmlowp [18]).

entry point：入口点，进入点，指令转移点，入水点
granularity [grænjʊ'lærɪtɪ]：n. 颗粒性
self-contained [ˌself kənˈteɪnd]：adj. (指人) 独立的，自立的，(指事物) 自给的，独立的，独门独户的，设施齐全的

We take the q1q_{1}q1 matrix to be the weights, and the q2q_{2}q2 matrix to be the activations. Both the weights and activations are of type uint8 (we could have equivalently chosen int8, with suitably modified zero-points). Accumulating products of uint8 values requires a 32-bit accumulator, and we choose a signed type for the accumulator for a reason that will soon become clear. The sum in (9) is thus of the form:
int32 +=uint8∗uint8.(10)\text{int32} \ += \text{uint8} * \text{uint8}. \tag{10}int32 +=uint8∗uint8.(10)

In order to have the quantized bias-addition be the addition of an int32 bias into this int32 accumulator, the bias-vector is quantized such that: it uses int32 as its quantized data type; it uses 0 as its quantization zero-point ZbiasZ_{\text{bias}}Zbias; and its quantization scale SbiasS_{\text{bias}}Sbias is the same as that of the accumulators, which is the product of the scales of the weights and of the input activations. In the notation of section 2.3,
Sbias=S1S2,Zbias=0.(11)S_{\text{bias}} = S_{1}S_{2}, \ Z_{\text{bias}} = 0. \tag{11}Sbias=S1S2, Zbias=0.(11)
为了使量化偏置加法 (the quantized bias-addition) 是在这个 int32 累加器 (int32 accumulator) 中添加一个 int32 偏置 (int32 bias)。the bias-vector 被量化为：使用 int32 作为其量化数据类型，使用 0 作为其量化零点 ZbiasZ_{\text{bias}}Zbias，其 quantization scale SbiasS_{\text{bias}}Sbias 与累加器的 quantization scale 相同，是 the scales of the weights 和 the scales of the input activations 的乘积。

addition [əˈdɪʃn]：n. 附加，加，添加，加法，增加，增加物，添加物
notation [nəʊˈteɪʃn]：n. 符号，记号，谱号

Although the bias-vectors are quantized as 32-bit values, they account for only a tiny fraction of the parameters in a neural network. Furthermore, the use of higher precision for bias vectors meets a real need: as each bias-vector entry is added to many output activations, any quantization error in the bias-vector tends to act as an overall bias (i.e. an error term with nonzero mean), which must be avoided in order to preserve good end-to-end neural network accuracy6^66.
尽管 the bias-vectors 被量化为 32-bit 值，但它们仅占神经网络中参数的一小部分。此外，对偏置向量使用更高的精度满足了实际需求：由于每个偏置矢量条目都被添加到多个输出激活中，偏置矢量中的任何量化误差都可能作为整体偏置 (即具有非零均值的误差项)，为了保持良好的端到端神经网络精度，必须避免这种情况。

6^66The quantization of bias-vectors discussed here is implemented here in the TensorFlow Lite [5] Converter.

With the final value of the int32 accumulator, there remain three things left to do: scale down to the final scale used by the 8-bit output activations, cast down to uint8 and apply the activation function to yield the final 8-bit output activation.

The down-scaling corresponds to multiplication by the multiplier MMM in equation (7). As explained in section 2.2, it is implemented as a fixed-point multiplication by a normalized multiplier M0M_0M0 and a rounding bit-shift. Afterwards, we perform a saturating cast to uint8, saturating to the range [0, 255].

We focus on activation functions that are mere clamps, e.g. ReLU, ReLU6. Mathematical functions are discussed in appendix A.1 and we do not currently fuse them into such layers. Thus, the only thing that our fused activation functions need to do is to further clamp the uint8 value to some sub-interval of [0, 255] before storing the final uint8 output activation. In practice, the quantized training process (section 3) tends to learn to make use of the whole output uint8 [0, 255] interval so that the activation function no longer does anything, its effect being subsumed in the clamping to [0, 255] implied in the saturating cast to uint8.
我们专注于截断类的激活函数，例如 ReLU，ReLU6。数学函数在附录 A.1 中讨论，我们目前没有将它们融合到这样的层中。因此，融合的激活函数唯一需要做的就是在存储最终的 uint8 输出特征之前进一步将 uint8 值钳位到 [0, 255] 的某个子区间。在实践中，the quantized training process (section 3) 倾向于学习利用整个输出 uint8 [0, 255] 区间，以便激活函数不再做任何事情，其效果被包含在钳制为 [0 , 255] 中，隐含在 uint8 的饱和强制转换中。

mere [mɪə(r)]：adj. 只不过，仅仅的，只凭 ... 就足以 n. 小湖，池塘
subsume [səbˈsjuːm]：vt. 将 ... 归入 (或纳入)

3. Training with simulated quantization

A common approach to training quantized networks is to train in floating point and then quantize the resulting weights (sometimes with additional post-quantization training for fine-tuning). We found that this approach works sufficiently well for large models with considerable representational capacity, but leads to significant accuracy drops for small models. Common failure modes for simple post-training quantization include: 1) large differences (more than 100×) in ranges of weights for different output channels (section 2 mandates that all channels of the same layer be quantized to the same resolution, which causes weights in channels with smaller ranges to have much higher relative error) and 2) outlier weight values that make all remaining weights less precise after quantization.
我们发现这种方法对于具有相当大的表示能力的大型模型来说效果很好，但会导致小型模型的准确度显著下降。简单训练后量化的常见失败模式包括：1) 不同输出通道的权重范围差异很大 (超过 100 倍) (第 2 节要求将同一层的所有通道量化到相同的分辨率，这会导致权重在具有较小范围的通道中具有更高的相对误差)；2) 异常权重值，这使得量化后所有剩余的权重不太精确。

We propose an approach that simulates quantization effects in the forward pass of training. Backpropagation still happens as usual, and all weights and biases are stored in floating point so that they can be easily nudged by small amounts. The forward propagation pass however simulates quantized inference as it will happen in the inference engine, by implementing in floating-point arithmetic the rounding behavior of the quantization scheme that we introduced in section 2:
我们提出了一种在训练的前向传递中模拟量化效果的方法。反向传播仍然像往常一样，所有的权重和偏差都存储在浮点数中，因此少量的误差也可以更新权重和偏置。The forward propagation pass 模拟了在推理引擎中发生的量化推理，通过在浮点算法中实现我们在第 2 节中介绍的量化方案的舍入行为：

Weights are quantized before they are convolved with the input. If batch normalization (see [17]) is used for the layer, the batch normalization parameters are “folded into” the weights before quantization, see section 3.2.
权重在与输入做卷积之前先进行量化。如果该层使用 batch normalization (see [17]) ，则在进行量化之前，将 batch normalization 参数合并到权重中，请参见第 3.2 节。
Activations are quantized at points where they would be during inference, e.g. after the activation function is applied to a convolutional or fully connected layer’s output, or after a bypass connection adds or concatenates the outputs of several layers together such as in ResNets.
推理过程中，输入特征会被量化，量化发生在将激活函数应用于卷积或全连接层的输出之后，或者在旁路连接将多个层的输出添加或连接在一起之后，例如在 ResNets 中。

nudge [nʌdʒ]：v. 轻推，(用肘) 轻触，(朝某方向) 渐渐推动，用胳膊肘挤开往前走，(使) 达到，接近，n. (肘部的) 轻推，碰

For each layer, quantization is parameterized by the number of quantization levels and clamping range, and is performed by applying point-wise the quantization function qqq defined as follows:
对于每一层，量化由量化级别的数量和钳位范围进行参数化，并通过逐点应用定义如下的量化函数 qqq 来执行：

clamp(r;a,b):=min(max(x,a),b)s(a,b,n):=b−an−1q(r;a,b,n):=⌊clamp(r;a,b)−as(a,b,n)⌉s(a,b,n)+a,(12)\begin{aligned} \text{clamp}(r; a, b) &:= \text{min} (\text{max}(x, a), b) \\ s(a, b, n) &:= \frac{b - a}{n - 1} \\ q(r; a, b, n) &:= \lfloor \frac{\text{clamp}(r; a, b) - a}{s(a, b, n)} \rceil s(a, b, n) + a, \tag{12} \\ \end{aligned} clamp(r;a,b)s(a,b,n)q(r;a,b,n):=min(max(x,a),b):=n−1b−a:=⌊s(a,b,n)clamp(r;a,b)−a⌉s(a,b,n)+a,(12)

where rrr is a real-valued number to be quantized, [a;b][a; b][a;b] is the quantization range, nnn is the number of quantization levels, and ⌊⋅⌉\lfloor·\rceil⌊⋅⌉ denotes rounding to the nearest integer. nnn is fixed for all layers in our experiments, e.g. n=28=256n = 2^8 = 256n=28=256 for 8 bit quantization.

3.1. Learning quantization ranges - 学习量化范围

Quantization ranges are treated differently for weight quantization vs. activation quantization:

For weights, the basic idea is simply to set a:=minw,b:=maxwa := \text{min} w, \ b := \text{max} wa:=minw, b:=maxw. We apply a minor tweak to this so that the weights, once quantized as int8 values, only range in [−127,127][-127, 127][−127,127] and never take the value −128-128−128, as this enables a substantial optimization opportunity (for more details, see Appendix B).

https://en.wikipedia.org/wiki/Pseudocode
Common mathematical symbols - 常见数学符号

Type of operation	Symbol	Example
Assignment (分配)	←\leftarrow← or :=	c←2πr,c:=2πrc \leftarrow 2 \pi r, \ c := 2 \pi rc←2πr, c:=2πr
Floor/ceiling (下界/上界)	⌊⌋,⌈⌉\lfloor \ \rfloor, \lceil \ \rceil⌊ ⌋,⌈ ⌉	a←⌊b⌋+⌈c⌉a \leftarrow \lfloor b \rfloor + \lceil c \rceila←⌊b⌋+⌈c⌉

For activations, ranges depend on the inputs to the network. To estimate the ranges, we collect [a;b][a; b][a;b] ranges seen on activations during training and then aggregate them via exponential moving averages (EMA) with the smoothing parameter being close to 1 so that observed ranges are smoothed across thousands of training steps. Given significant delay in the EMA updating activation ranges when the ranges shift rapidly, we found it useful to completely disable activation quantization at the start of training (say, for 50 thousand to 2 million steps). This allows the network to enter a more stable state where activation quantization ranges do not exclude a significant fraction of values.
为了估计范围，我们收集了在训练过程中落在 [a;b][a; b][a;b] 范围的数据，然后通过平滑参数接近 1 的指数移动平均值 (EMA)，对在数千个训练步骤中观察到的数据范围进行平滑处理。考虑到输入特征数据范围快速变化时，EMA 更新数据范围有明显延迟。我们发现在训练开始时完全禁用激活量化很有用 (例如，对于 50000 到 200 万步)。这允许网络进入更稳定的状态，这时的输入激活量化范围不会排除大量值。

tweak [twiːk]：n. 拧，扭，扯，轻微调整 v. 拧，扭，扯，稍稍调整
substantial [səb'stænʃ(ə)l]：adj. 大量的，价值巨大的，重大的，大而坚固的
pseudocode ['sju:dəʊˌkəʊd]：n. 伪码，伪代码

In both cases, the boundaries [a;b][a; b][a;b] are nudged so that value 0.0 is exactly representable as an integer z(a,b,n)z(a, b, n)z(a,b,n) after quantization. As a result, the learned quantization parameters map to the scale SSS and zero-point ZZZ in equation 1:
S=s(a,b,n),Z=z(a,b,n)(13)S = s(a, b, n), \ Z = z(a, b, n) \tag{13}S=s(a,b,n), Z=z(a,b,n)(13)

Below we depict simulated quantization assuming that the computations of a neural network are captured as a TensorFlow graph [1]. A typical workflow is described in Algorithm 1. Optimization of the inference graph by fusing and removing operations is outside the scope of this paper. Source code for graph modifications (inserting fake quantization operations, creating and optimizing the inference graph) and a low bit inference engine has been open-sourced with TensorFlow contributions in [19].
计算图修改 (插入伪量化操作、创建和优化推理图) 和低比特推理引擎的源代码已在TensorFlow [19] 中开源。

Algorithm 1 Quantized graph training and inference

1: Create a training graph of the floating-point model.
2: Insert fake quantization TensorFlow operations in locations where tensors will be downcasted to fewer bits during inference according to equation 12.
3: Train in simulated quantized mode until convergence.
4: Create and optimize the inference graph for running in a low bit inference engine.
5: Run inference using the quantized inference graph.

Figure 1.1: Integer-arithmetic-only quantization. a) Integer-arithmetic-only inference of a convolution layer. The input and output are represented as 8-bit integers according to equation 1. The convolution involves 8-bit integer operands and a 32-bit integer accumulator. The bias addition involves only 32-bit integers (section 2.4). The ReLU6 nonlinearity only involves 8-bit integer arithmetic. b) Training with simulated quantization of the convolution layer. All variables and computations are carried out using 32-bit floating-point arithmetic. Weight quantization (“wt quant”) and activation quantization (“act quant”) nodes are injected into the computation graph to simulate the effects of quantization of the variables (section 3). The resultant graph approximates the integer-arithmetic-only computation graph in panel a), while being trainable using conventional optimization algorithms for floating point models. c) Our quantization scheme benefits from the fast integer-arithmetic circuits in common CPUs to deliver an improved latency-vs-accuracy tradeoff (section 4). The figure compares integer quantized MobileNets [10] against floating point baselines on ImageNet [3] using Qualcomm Snapdragon 835 LITTLE cores.

resultant [rɪ'zʌltənt]：n. 结果，合成，组合，后果 adj. 因而发生的，因此而产生的
panel ['pæn(ə)l]：n. 镶板，仪表盘，钣金，镶条 v. 镶板

Figure 1.1 a and b illustrate TensorFlow graphs before and after quantization for a simple convolutional layer. Illustrations of the more complex convolution with a bypass connection in figure C.3 can be found in figure C.4.

Note that the biases are not quantized because they are represented as 32-bit integers in the inference process, with a much higher range and precision compared to the 8 bit weights and activations. Furthermore, quantization parameters used for biases are inferred from the quantization parameters of the weights and activations. See section 2.4.
请注意，biases 没有被量化，因为它们在推理过程中表示为 32-bit 整数，与 8 bit 权重和激活相比，具有更高的范围和精度。此外，用于偏差的量化参数是从权重和激活的量化参数中推断出来的。

Typical TensorFlow code illustrating use of [19] follows:

from tf.contrib.quantize \import quantize_graph as qgg = tf.Graph()
with g.as_default():output = ...total_loss = ...optimizer = ...train_tensor = ...
if is_training:quantized_graph = \qg.create_training_graph(g)
else:quantized_graph = \qg.create_eval_graph(g)
# Train or evaluate quantized_graph.

3.2. Batch normalization folding

For models that use batch normalization (see [17]), there is additional complexity: the training graph contains batch normalization as a separate block of operations, whereas the inference graph has batch normalization parameters “folded” into the convolutional or fully connected layer’s weights and biases, for efficiency. To accurately simulate quantization effects, we need to simulate this folding, and quantize weights after they have been scaled by the batch normalization parameters. We do so with the following:
wfold:=γwEMA(σB2)+ε(14)w_{\text{fold}} := \frac{\gamma w}{ \sqrt{EMA({\sigma}^{2}_{B}) + \varepsilon}} \tag{14}wfold:=EMA(σB2)+εγw(14)
Here γ\gammaγ is the batch normalization’s scale parameter, EMA(σB2)EMA({\sigma}^{2}_{B})EMA(σB2) is the moving average estimate of the variance of convolution results across the batch, and ε\varepsilonε is just a small constant for numerical stability.
为了准确地模拟量化效果，我们需要模拟这种折叠，并在权重被 the batch normalization parameters 缩放后量化权重。

After folding, the batch-normalized convolutional layer reduces to the simple convolutional layer depicted in figure 1.1a with the folded weights wfoldw_{\text{fold}}wfold and the corresponding folded biases. Therefore the same recipe in figure 1.1b applies. See the appendix for the training graph (figure C.5) for a batch-normalized convolutional layer, the corresponding inference graph (figure C.6), the training graph after batch-norm folding (figure C.7) and the training graph after both folding and quantization (figure C.8).

4. Experiments

We conducted two set of experiments, one showcasing the effectiveness of quantized training (Section. 4.1), and the other illustrating the improved latency-vs-accuracy tradeoff of quantized models on common hardware (Section. 4.2). The most performance-critical part of the inference workload on the neural networks being benchmarked is matrix multiplication (GEMM). The 8-bit and 32-bit floating-point GEMM inference code uses the gemmlowp library [18] for 8-bit quantized inference, and the Eigen library [6] for 32-bit floating-point inference.
我们进行了两组实验，一组展示了量化训练的有效性 (Section. 4.1)，另一组展示了在通用硬件上改进了量化模型的时延/精度均衡 (Section. 4.2)。

4.1. Quantized training of Large Networks

We apply quantized training to ResNets [9] and InceptionV3 [30] on the ImageNet dataset. These popular networks are too computationally intensive to be deployed on mobile devices, but are included for comparison purposes. Training protocols are discussed in Appendix D.1 and D.2.

4.1.1 ResNets

We compare floating-point vs integer-quantized ResNets for various depths in table 4.1. Accuracies of integer-only quantized networks are within 2% of their floating-point counterparts.

Table 4.1: ResNet on ImageNet: Floating-point vs quantized network accuracy for various network depths.

We also list ResNet50 accuracies under different quantization schemes in table 4.2. As expected, integer-only quantization outperforms FGQ [26], which uses 2 bits for weight quantization. INQ [33] (5-bit weight floating-point activation) achieves a similar accuracy as ours, but we provide additional run-time improvements (see section 4.2).

Table 4.2: ResNet on ImageNet: Accuracy under various quantization schemes, including binary weight networks (BWN [21, 15]), ternary weight networks (TWN [21, 22]), incremental network quantization (INQ [33]) and fine-grained quantization (FGQ [26])

counterpart [ˈkaʊntə(r)ˌpɑː(r)t]：n. 职位 (或作用) 相当的人，对应的事物
fine-grained [faɪn ɡreɪnd]：adj. 细粒的，有细密纹理的，细粒度的

4.1.2 Inception v3 on ImageNet

We compare the Inception v3 model quantized into 8 and 7 bits, respectively. 7-bit quantization is obtained by setting the number of quantization levels in equation 12 to n=27n = 2^7n=27. We additionally probe the sensitivity of activation quantization by comparing networks with two activation nonlinearities, ReLU6 and ReLU. The training protocol is in Appendix D.2.

Table 4.3 shows that 7-bit quantized training produces model accuracies close to that of 8-bit quantized training, and quantized models with ReLU6 have less accuracy degradation. The latter can be explained by noticing that ReLU6 introduces the interval [0, 6] as a natural range for activations, while ReLU allows activations to take values from a possibly larger interval, with different ranges in different channels. Values in a fixed range are easier to quantize with high precision.
固定范围内的值更容易以高精度量化。

incremental [ˌɪŋkrɪˈment(ə)l]：adj. 增加的，增量的，增额的
sensitivity [.sensə'tɪvəti]：n. 敏感性，灵敏度，过敏性，体贴

Table 4.3: Inception v3 on ImageNet: Accuracy and recall 5 comparison of floating point and quantized models.

4.2. Quantization of MobileNets

MobileNets are a family of architectures that achieve a state-of-the-art tradeoff between on-device latency and ImageNet classification accuracy. In this section we demonstrate how integer-only quantization can further improve the tradeoff on common hardware.

4.2.1 ImageNet

We benchmarked the MobileNet architecture with varying depth-multipliers (DM) and resolutions on ImageNet on three types of Qualcomm cores, which represent three different micro-architectures: 1) Snapdragon 835 LITTLE core, (figure. 1.1c), a power-efficient processor found in Google Pixel 2; 2) Snapdragon 835 big core (figure. 4.1), a high-performance core employed by Google Pixel 2; and 3) Snapdragon 821 big core (figure. 4.2), a high-performance core used in Google Pixel 1.

Figure 4.1: ImageNet classifier on Qualcomm Snapdragon 835 big cores: Latency-vs-accuracy tradeoff of floating-point and integer-only MobileNets.

Figure 4.2: ImageNet classifier on Qualcomm Snapdragon 821: Latency-vs-accuracy tradeoff of floating-point and integer-only MobileNets.

Integer-only quantized MobileNets achieve higher accuracies than floating-point MobileNets given the same run- time budget. The accuracy gap is quite substantial (~10%) for Snapdragon 835 LITTLE cores at the 33ms latency needed for real-time (30 fps) operation. While most of the quantization literature focuses on minimizing accuracy loss for a given architecture, we advocate for a more comprehensive latency-vs-accuracy tradeoff as a better measure. Note that this tradeoff depends critically on the relative speed of floating-point vs integer-only arithmetic in hardware. Floating-point computation is better optimized in the Snapdragon 821, for example, resulting in a less noticeable reduction in latency for quantized models.
虽然大多数量化文献都侧重于最小化给定架构的精度损失，但我们提倡更全面的延迟与精度权衡作为更好的衡量标准。请注意，这种权衡主要取决于硬件中浮点运算与仅整数运算的相对速度。

snapdragon [ˈsnæpdræɡən]：n. 啮龙花，金鱼草
budget [ˈbʌdʒɪt]：n. 预算，政府的年度预算 adj. 价格低廉的，花钱少的 v. 谨慎花钱，把 ... 编入预算
advocate ['ædvəkət]：v. 提倡，拥护，支持 n. 拥护者，支持者，辩护律师，提倡者
critically ['krɪtɪkli]：adv. 批判地，批判性地，苛求地，危急地，严重地
noticeable [ˈnəʊtɪsəb(ə)l]：adj. 显著的，显而易见的

4.2.2 COCO

We evaluated quantization in the context of mobile real time object detection, comparing the performance of quantized 8-bit and float models of MobileNet SSD [10, 25] on the COCO dataset [24]. We replaced all the regular convolutions in the SSD prediction layers with separable convolutions (depthwise followed by 1×11 \times 11×1 projection). This modification is consistent with the overall design of MobileNets and makes them more computationally efficient. We utilized the Open Source TensorFlow Object Detection API [12] to train and evaluate our models. The training protocol is described in Appendix D.3. We also delayed quantization for 500 thousand steps (see section 3.1), finding that it significantly decreases the time to convergence.

convergence [kənˈvɜː(r)dʒ(ə)ns]：n. 会聚，集合点，收敛，趋同 (现象)

Table 4.4 shows the latency-vs-accuracy tradeoff between floating-point and integer-quantized models. Latency was measured on a single thread using Snapdragon 835 cores (big and LITTLE). Quantized training and inference results in up to a 50% reduction in running time, with a minimal loss in accuracy (-1.8% relative).

Table 4.4: Object detection speed and accuracy on COCO dataset of floating point and integer-only quantized models.
Latency (ms) is measured on Qualcomm Snapdragon 835 big and LITTLE cores.

4.2.3 Face detection

To better examine quantized MobileNet SSD on a smaller scale, we benchmarked face detection on the face attribute classification dataset (a Flickr-based dataset used in [10]). We contacted the authors of [10] to evaluate our quantized MobileNets on detection and face attributes following the same protocols (detailed in Appendix D.4).

As indicated by tables 4.5 and 4.6, quantization provides close to a 2×2\times2× latency reduction with a Qualcomm Snapdragon 835 big or LITTLE core at the cost of a ~2% drop in the average precision. Notably, quantization allows the 25% face detector to run in real-time (1K/28 ≈\approx≈ 36 fps) on a single big core, whereas the floating-point model remains slower than real-time (1K/44 ≈\approx≈ 23 fps).

Table 4.5: Face detection accuracy of floating point and integer-only quantized models. The reported precision / recall is averaged over different precision / recall values where an IOU of xxx between the groundtruth and predicted windows is considered a correct detection, for xxx in {0.5,0.55,...,0.95}\{0.5, 0.55, . . ., 0.95\}{0.5,0.55,...,0.95}.

Table 4.6: Face detection: latency of floating point and quantized models on Qualcomm Snapdragon 835 cores.

We additionally examine the effect of multi-threading on the latency of quantized models. Table 4.6 shows a 1.51.51.5 to 2.2×2.2\times2.2×) speedup when using 4 cores. The speedup ratios are comparable between the two cores, and are higher for larger models where the overhead of multi-threading occupies a smaller fraction of the total computation.

4.2.4 Face attributes

Figure 4.3 shows the latency-vs-accuracy tradeoff of face attribute classification on the Qualcomm Snapdragon 821. Since quantized training results in little accuracy degradation, we see an improved tradeoff even though the Qualcomm Snapdragon 821 is highly optimized for floating point arithmetic (see Figure 4.2 for comparison).

Figure 4.3: Face attribute classifier on Qualcomm Snapdragon 821: Latency-vs-accuracy tradeoff of floating-point and integer-only MobileNets.

Ablation study To understand performance sensitivity to the quantization scheme, we further evaluate quantized training with varying weight and activation quantization bit depths. The degradation in average precision for binary attributes and age precision relative to the floating-point baseline are shown in Tables 4.7 and 4.8, respectively. The tables suggest that 1) weights are more sensitive to reduced quantization bit depth than activations, 2) 8 and 7-bit quantized models perform similarly to floating point models, and 3) when the total bit-depths are equal, it is better to keep weight and activation bit depths the same.

ablation [ə'bleɪʃn]：n. 消融，冰消作用，磨蚀

Table 4.7: Face attributes: relative average category precision of integer-quantized MobileNets (varying weight and activation bit depths) compared with floating point.

Table 4.8: Face attributes: Age precision at difference of 5 years for quantized model (varying weight and activation bit depths) compared with floating point.

5. Discussion

We propose a quantization scheme that relies only on integer arithmetic to approximate the floating-point computations in a neural network. Training that simulates the effect of quantization helps to restore model accuracy to near-identical levels as the original. In addition to the 4×4\times4× reduction of model size, inference efficiency is improved via ARM NEON-based implementations. The improvement advances the state-of-the-art tradeoff between latency on common ARM CPUs and the accuracy of popular computer vision models. The synergy between our quantization scheme and efficient architecture design suggests that integer-arithmetic-only inference could be a key enabler that propels visual recognition technologies into the real-time and low-end phone market.

synergy [ˈsɪnə(r)dʒi]：n. 协同作用，协同效果，增效作用
propel [prə'pel]：v. 推进，推动，驱动，驱使

References 1

A. Appendix: Layer-specific details

A.1. Mathematical functions

Math functions such as hyperbolic tangent, the logistic function, and softmax often appear in neural networks. No lookup tables are needed since these functions are implemented in pure fixed-point arithmetic similarly to how they would be implemented in floating-point arithmetic7^77.
不需要查找表，因为这些函数是在纯定点算术中实现的，类似于它们在浮点算术中的实现方式。

7^77Pure-arithmetic, SIMD-ready, branch-free, fixed-point implementations of at least tanh and the logistic functions are given in gemmlowp [18]’s fixedpoint directory, with specializations for NEON and SSE instruction sets. One can see in TensorFlow Lite [5] how these are called.

hyperbolic [ˌhaɪpə'bɒlɪk]：adj. 双曲线的，夸张的，夸张法的
tangent [ˈtændʒ(ə)nt]：n. 切线，正切

A.2. Addition

Some neural networks use a plain Addition layer type, that simply adds two activation arrays together. Such Addition layers are more expensive in quantized inference compared to floating-point because rescaling is needed: one input needs to be rescaled onto the other’s scale using a fixedpoint multiplication by the multiplier M = S1/S2 similar to what we have seen earlier (end of section 2.2), before the actual addition can be performed as a simple integer addition; finally, the result must be rescaled again to fit the output array’s scale8^88.

8^88See the TensorFlow Lite [5] implementation.

A.3. Concatenation

Fully general support for concatenation layers poses the same rescaling problem as Addition layers. Because such rescaling of uint8 values would be a lossy operation, and as it seems that concatenation ought to be a lossless operation, we prefer to handle this problem differently: instead of implementing lossy rescaling, we introduce a requirement that all the input activations and the output activations in a Concatenation layer have the same quantization parameters. This removes the need for rescaling and concatenations are thus lossless and free of any arithmetic9^99.

9^99This is implemented in this part of the TensorFlow Lite [5] Converter

B. Appendix: ARM NEON details

This section assumes familiarity with assembly programming on the ARM NEON instruction set. The instruction mnemonics below refer to the 64-bit ARM instruction set, but the discussion applies equally to 32-bit ARM instructions.

The fixed-point multiplications referenced throughout this article map exactly to the SQRDMULH instruction. It is very important to use the correctly-rounding instruction SQRDMULH and not SQDMULH10^{10}10.

The rounding-to-nearest right-shifts referenced in section 2.2 do not map exactly to any ARM NEON instruction.

10^{10}10The fixed-point math function implementations in gemmlowp [18] use such fixed-point multiplications, and ordinary (non-saturating) integer additions. We have no use for general saturated arithmetic.

right-shifts：右移
familiarity [fəˌmɪlɪˈærəti]：n. 熟悉，通晓，亲近，认识
mnemonic [nɪˈmɒnɪk]：adj. 记忆的，记忆术的，增进记忆的
instruction mnemonics：指令助记符
saturate [ˈsætʃəreɪt]：v. 浸透，使湿透，使充满，使饱和 adj. 浸透的，饱和的

The problem is that the “rounding right shift” instruction, RSHL with variable negative offset, breaks ties by rounding upward, instead of rounding them away from zero. For example, if we use RSHL to implement the division −12/23-12/2^3−12/23, the result will be -1 whereas it should be -2 with “round to nearest”. This is problematic as it results in an overall upward bias, which has been observed to cause significant loss of end-to-end accuracy in neural network inference. A correct round-to-nearest right-shift can still be implemented using RSHL but with suitable fix-up arithmetic around it11^{11}11.

11^{11}11It is implemented here in gemmlowp [18].

tie [taɪ]：n. 领带，联系，关系，束缚 v. 绑，束缚，捆绑，约束
problematic [ˌprɒbləˈmætɪk]：adj. 造成困难的，产生问题的

For efficient NEON implementation of the matrix multiplication’s core accumulation, we use the following trick. In the multiply-add operation in (10), we first change the operands’ type from uint8 to int8 (which can be done by subtracting 128 from the quantized values and zero-points). Thus the core multiply-add becomes
int32 +=int8∗int8.(B.1)\text{int32} \ += \text{int8} * \text{int8}. \tag{B.1}int32 +=int8∗int8.(B.1)

As mentioned in section 3, with a minor tweak of the quantized training process, we can ensure that the weights, once quantized as int8 values, never take the value −128-128−128. Hence, the product in (B.1) is never −128∗−128-128 \ ∗ -128−128 ∗−128, and is therefore always less than 2142^{14}214 in absolute value. Hence, (B.1) can accumulate two products on a local int16 accumulator before that needs to be accumulated into the true int32 accumulator. This allows the use of an 8-way SIMD multiplication (SMULL on int8 operands), followed by an 8-way SIMD multiply-add (SMLAL on int8 operands), followed by a pairwise-add-and-accumulate into the int32 accumulators (SADALP)12^{12}12.

12^{12}12This technique is implemented in the optimized NEON kernel in gemmlowp [18], which is in particular what TensorFlow Lite uses (see the choice of L8R8WithLhsNonzeroBitDepthParams at this line).

in particular：尤其，特别
trick [trɪk]：n. 技巧，诡计，诀窍，把戏 v. 欺骗，欺诈 adj. 意在欺骗的，容易使人上当的，虚弱有毛病的
tweak [twiːk]：n. 拧，扭，扯，轻微调整 v. 拧，扭，扯，稍稍调整

C. Appendix: Graph diagrams

D. Experimental protocols

D.1. ResNet protocol

Preprocessing. All images from ImageNet [3] are resized preserving aspect ratio so that the smallest side of the image is 256. Then the center 224×224224 × 224224×224 patch is cropped and the means are subtracted for each of the RGB channels.

Optimization. We use the momentum optimizer from TensorFlow [1] with momentum 0.9 and a batch size of 32.The learning rate starts from 10−510^{-5}10−5 and decays in a staircase fashion by 0.1 for every 30 epochs. Activation quantization is delayed for 500, 000 steps for reasons discussed in section 3. Training uses 50 workers asynchronously, and stops after validation accuracy plateaus, normally after 100 epochs.

momentum [məʊˈmentəm]：n. 动量，势头，动力，推进力
staircase [ˈsteə(r)ˌkeɪs]：n. 楼梯
plateau [ˈplætoʊ]：n. 高原，台地，平台期，高原期 v. 进入停滞期，达到平稳状态
epoch [ˈiːpɒk]：n. 时期，新纪元，新时代，阶段

D.2. Inception protocol

All results in table 4.3 were obtained after training for approximately 10 million steps, with batches of 32 samples, using 50 distributed workers, asynchronously. Training data were ImageNet 2012 299×299299 \times 299299×299 images with labels. Image augmentation consisted of: random crops, random horizontal flips, and random color distortion. The optimizer used was RMSProp with learning rate starting at 0.045 and decaying exponentially and stepwise with factor 0.94 after every 2 epochs. Other RMSProp parameters were: 0.9 momentum, 0.9 decay, 1.0 epsilon term. Trained parameters were EMA averaged with decay 0.9999.

D.3. COCO detection protocol

Preprocessing. During training, all images are randomly cropped and resized to 320×320320 \times 320320×320. During evaluation, all images are directly resized to 320×320320 \times 320320×320. All input values are normalized to [-1, 1].

Optimization. We used the RMSprop optimizer from TensorFlow [1] with a batch size of 32. The learning rate starts from 4×10−34 \times 10^{-3}4×10−3 and decays in a staircase fashion by a factor of 0.1 for every 100 epochs. Activation quantization is delayed for 500, 000 steps for reasons discussed in section 3. Training uses 20 workers asynchronously, and stops after validation accuracy plateaus, normally after approximately 6 million steps.

Metrics. Evaluation results are reported with the COCO primary challenge metric: AP at IoU=.50:.05:.95\text{IoU}=.50:.05:.95IoU=.50:.05:.95. We follow the same train/eval split in [13].

D.4. Face detection and face attribute classification protocol

Preprocessing. Random 1:1 crops are taken from images in the Flickr-based dataset used in [10] and resized to 320×320320 \times 320320×320 pixels for face detection and 128×128128 \times 128128×128 pixels for face attribute classification. The resulting crops are flipped horizontally with a 50% probability. The values for each of the RGB channels are renormalized to be in the range [-1, 1].

Face Detection Optimization. We used the RMSprop optimizer from TensorFlow [1] with a batch size of 32. The learning rate starts from 4×10−34 \times 10^{-3}4×10−3 and decays in a staircase fashion by a factor of 0.1 for every 100 epochs. Activation quantization is delayed for 500, 000 steps for reasons discussed in section 3. Training uses 20 workers asynchronously, and stops after validation accuracy plateaus, normally after approximately 3 million steps.

Face Attribute Classification Optimization. We followed the optimization protocol in [10]. We used the Adagrad optimizer from Tensorflow[1] with a batch size of 32 and a constant learning rate of 0.1. Training uses 12 workers asynchronously, and stops at 20 million steps.

Latency Measurements. We created a binary that runs the face detection and face attributes classification models repeatedly on random inputs for 100 seconds. We pushed this binary to Pixel and Pixel 2 phones using the adb push command, and executed it on 1, 2, and 4 LITTLE cores, and 1, 2, and 4 big cores using the adb shell command with the appropriate taskset specified. We reported the average runtime of the face detector model on 320×320320 \times 320320×320 inputs, and of the face attributes classifier model on 128×128128 \times 128128×128 inputs.

References 2

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
Quantizing deep convolutional networks for efficient inference: A whitepaper
8-bit Inference with TensorRT
Quantization Algorithms
美国国家概况
NCNN Post Training Quantization Tools
Caffe-Int8-Convert-Tools
https://zhuanlan.zhihu.com/zhangxiaolongOptimization
https://github.com/bigbigzxl
Quantization module for generating quantized (INT8) models from FP32 models
https://github.com/lutzroeder/netron/releases

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference (纯整数计算)相关推荐

【论文阅读笔记】Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
该方法的简称为:IAO 该论文提出了一种允许网络在推测时只使用整数计算的方法,即 float32 --> int8. 该文在MobileNets等已经压缩过的网络进行量化,测试数据集为I ...
论文笔记 Semantics-Guided Neural Networks for Efficient Skeleton-Based Human Action Recognition - CVPR
Semantics-Guided Neural Networks for Efficient Skeleton-Based Human Action Recognition 2020 CVPR | c ...
One-Pass Multi-task Convolutional Neural Networks for Efficient Brain Tumor Segmentation
method: 首先:用三个网络来训练,分别针对comlete区域,core区域,和enhancing区域(使用网络OM-net) 1.使用随机采样从MRI大脑图像中采块,训练,分类器分为5类,最后测 ...
【阅读笔记】Differentiable plasticity: training plastic neural networks with backpropagation
Differentiable plasticity: training plastic neural networks with backpropagation 作者: Thomas Miconi/J ...
Differentiable plasticity: training plastic neural networks withbackpropagation
http://proceedings.mlr.press/v80/miconi18a/miconi18a.pdf 文章目录 Abstract Introduction Related work int ...
SGN：CVPR20-Semantics-Guided Neural Networks for Efficient Skeleton-Based Human Action Recognition
Abstract In this paper, we propose a simple yet effective semantics-guided neural network (SGN) for ...
Domain-Adversarial Training of Neural Networks
本篇是迁移学习专栏介绍的第十八篇论文,发表在JMLR2016上. Abstrac 提出了一种新的领域适应表示学习方法,即训练和测试时的数据来自相似但不同的分布.我们的方法直接受到域适应理论的启发,该理 ...
关于Training deep neural networks for binary communication with the Whetstone method的代码实现
GitHub网址如下: https://github.com/SNL-NERL/Whetstone/blob/master/examples/adaptive_mnist.py 实现过程中解决的问题: ...
DANN：Domain-Adversarial Training of Neural Networks
DANN原理理解 DANN中源域和目标域经过相同的映射来实现对齐. DANN的目标函数分为两部分: 1. 源域分类损失项 2. 源域和目标域域分类损失项 1.源域分类损失项对于一个m维的数据点X ...
论文翻译：Semantics-Guided Neural Networks for Efficient Skeleton-Based Human Action Recognition
基于骨骼的人体动作识别由于其易于获取人体骨骼数据而引起了人们的极大兴趣.近年来,在不考虑计算效率的情况下,利用深度前馈神经网络对关节三维坐标进行建模成为一种趋势.在本文中,我们提出一种简单而有效的基于 ...

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference (纯整数计算)