论文阅读<2> Dynamic Routing Between Capsules

Abstract
1 Introduction
2 How the vector inputs and outputs of a capsule are computed
3 Margin loss for digit existence
4 CapsNet architecture
- 4.1 Reconstruction as a regularization method
5 Capsules on MNIST
- 5.1 What the individual dimensions of a capsule represent
- 5.2 Robustness to Affine Transformations
6 Segmenting highly overlapping digits
7 Other datasets
8 Discussion and previous work
A How many routing iterations to use?
Reference
Code

Abstract

A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or an object part. We use the length of the activity vector to represent the probability that the entity exists and its orientation to represent the instantiation parameters. Active capsules at one level make predictions, via transformation matrices, for the instantiation parameters of higher-level capsules. When multiple predictions agree, a higher level capsule becomes active. We show that a discrimininatively trained, multi-layer capsule system achieves state-of-the-art performance on MNIST and is considerably better than a convolutional net at recognizing highly overlapping digits. To achieve these results we use an iterative routing-by-agreement mechanism: A lower-level capsule prefers to send its output to higher level capsules whose activity vectors have a big scalar product with the prediction coming from the lower-level capsule.

摘要：胶囊是一组神经元，其激活向量表示特定类型实体(比如目标或目标部分)的实例化参数。我们用激活向量的长度来表示实体存在的概率，而其方向表示实例化参数。同一层的活性胶囊，通过变换矩阵，对更高层的胶囊的实例化参数进行预测。当多个预测相一致时，较高层的胶囊被激活。我们展示出一个具有判别性的训练，多层胶囊系统在MNIST上实现了最先进的性能，在识别高度重叠的数字时，其效果明显要好于卷积神经网络。为了达到这样的结果，我们使用了迭代路由协议机制：低层胶囊倾向于将其输出发送到更高层胶囊，更高层胶囊（激活向量具有大的标量积）的预测来自低层胶囊。

1 Introduction

解析树通常是在动态分配内存中构建的。然而，在Hinton等人之后我们假设，对于单一的固定，解析树是由一个固定的多层神经网络雕刻成的，就像雕刻在岩石上的雕塑一样。每层将分成许多小组称为“胶囊”的神经元（Hinton等人），解析树的每个节点对应一个激活胶囊。使用迭代路由过程，每个激活胶囊将在其上层胶囊中选择一个作为父节点。对于较高级别的视觉系统，这个迭代过程将解决部件分配给整体的问题。

An obvious way to represent existence is by using a separate logistic unit whose output is the probability that the entity exists. In this paper we explore an interesting alternative which is to use the overall length of the vector of instantiation parameters to represent the existence of the entity and to force the orientationof the vector to represent the properties of the entity1.

激活胶囊内的神经元活动表示图像中存在的特定实体的各种属性。这些属性可以包括许多不同类型的实例化参数，如姿势（位置，大小，方向）、形变、速度、反射率、色调、纹理等等。一个非常特殊的属性是图像中实例化实体的存在。表示实体存在的一个显而易见的方法是用一个独立的逻辑单元，其输出是实体存在的概率。在本文中，我们探索一种有趣的替代方法，即用实例化参数向量的总长度来表示实体的存在，并强制向量的方向来表示实体的属性。我们通过应用非线性使向量的方向不变但是规模缩小，从而保证了胶囊向量的输出长度不超过1。

For each possible parent, the capsule computes a “prediction vector” by multiplying its own output by a weight matrix. This type of “routing-by-agreement” should be far more effective than the very primitive form of routing implemented by max-pooling, which allows neurons in one layer to ignore all but the most active feature detector in a local pool in the layer below. We demonstrate that our dynamic routing mechanism is an effective way to implement the “explaining away” that is needed for segmenting highly overlapping objects.

胶囊的输出是向量的事实使得它可以使用一个强大的动态路由机制以确保胶囊的输出发送至上层合适的父节点。最初，输出被发送到所有可能的父节点，但是被耦合系数缩减总数为1。对于每个可能的父类，胶囊通过将对自身输出乘以一个权重矩阵来计算“预测向量”。如果这个预测向量是一个可能父类输出的一个很大的标量积，那么就有一个自上而下的反馈，增加该父类的耦合系数，并减少其他父节点的耦合系数。这增加了胶囊对于该父节点的贡献，从而进一步增加了使用父节点输出的胶囊预测的标量积。这种类型的“路由协议”要比最大池化（max-pooling）实现的最原始的路由方式更加有效，它允许一层中的神经元忽略除过下面一层的局部池化中最活跃的特征检测器之外所有的。我们证明了我们的动态路由机制是一个可用于解释高度重叠的目标分割的有效方法。

Convolutional neural networks (CNNs) use translated replicas of learned feature detectors. This allows them to translate knowledge about good weight values acquired at one position in an image to other positions. This has proven extremely helpful in image interpretation. Even though we are replacing the scalar-output feature detectors of CNNs with vector-output capsules and max-pooling with routing-by-agreement, we would still like to replicate learned knowledge across space. To achieve this, we make all but the last layer of capsules be convolutional. As with CNNs, we make higher-level capsules cover larger regions of the image. Unlike max-pooling however, we do not throw away information about the precise position of the entity within the region. For low level capsules, location information is “place-coded” by which capsule is active. As we ascend the hierarchy, more and more of the positional information is “rate-coded” in the real-valued components of the output vector of a capsule. This shift from place-coding to rate-coding combined with the fact that higher-level capsules represent more complex entities with more degrees of freedom suggests that the dimensionality of capsules should increase as we ascend the hierarchy.

卷积神经网络（CNNs）使用学习特征检测器的变换副本。这使得它们能够将图像中的某个位置获取的良好权重值知识变换到其他位置。这已被证实在图像解释中是非常有帮助的。尽管我们正在用向量-输出胶囊取代CNNs的标量-输出特征检测器，以及用路由协议取代最大池化，但是我们仍然希望在整个空间中复制学到的知识。为了实现该目标，我们让除了胶囊最后一层外的其它层做卷积。与CNNs一样，我们使用更高层的胶囊覆盖图像的更大区域。然而，不同于最大池化，我们不会丢弃远离该区域内实体的精确位置信息。对于低层胶囊，位置信息是激活胶囊的“位置-编码”。随着层次结构的升级，有越来越多的位置信息成为胶囊输出向量实际值分量中的“速度-编码”。从位置-编码到速度-编码的转变，再加上更高层的胶囊代表具有更多自由度更复杂的实体的事实，表明随着层次的提升，胶囊的维度应该会增加。

2 How the vector inputs and outputs of a capsule are computed

The aim of this paper is not to explore this whole space but simply to show that one fairly straightforward implementation works well and that dynamic routing helps.
本文的目的不是探索这整个空间，而是简单地说明一个相当简单的实现是有效的，并且动态路径规划是有帮助的。

We want the length of the output vector of a capsule to represent the probability that the entity represented by the capsule is present in the current input. We therefore use a non-linear “squashing” function to ensure that short vectors get shrunk to almost zero length and long vectors get shrunk to a length slightly below 1. We leave it to discriminative learning to make good use of this non-linearity.

我们用胶囊的输出向量的长度表示胶囊所代表的实体在当前的输入中存在的概率。因此，我们使用一个非线性的“挤压（squashing）”函数以确保短向量被压缩至接近0，长向量被压缩至略低于1。我们让其进行判别性学习以充分利用该非线性。

where vj is the vector output of capsule j and sj is its total input.
For all but the first layer of capsules, the total input to a capsule sj is a weighted sum over all “prediction vectors” uˆj|i from the capsules in the layer below and is produced by multiplying the output ui of a capsule in the layer below by a weight matrix Wij

其中v_ j是胶囊 j 的向量输出，s_ j是其总输入。除了胶囊体第一层外的其他层，胶囊的总输入s_ j是来自于下一层的胶囊所有“预测向量”u_ j|i 的一个加权和，其通过权重矩阵W_ij 乘以下一层胶囊的输出u_i 产生。

where the cij are coupling coefficients that are determined by the iterative dynamic routing process.The coupling coefficients between capsule i and all the capsules in the layer above sum to 1 and are determined by a “routing softmax” whose initial logits bij are the log prior probabilities that capsule i should be coupled to capsule j.
其中c_ij是由迭代动态路由过程决定的耦合系数。胶囊 i 和高一层的所有胶囊间的耦合系数总和为1，并由“路由softmax”决定，该“路由softmax”的初始逻辑b_ij 是对数先验概率，即胶囊 i 应该与胶囊 j 耦合。

The log priors can be learned discriminatively at the same time as all the other weights. They depend on the location and type of the two capsules but not on the current input image2. The initial coupling coefficients are then iteratively refined by measuring the agreement between the current output vj of each capsule, j, in the layer above and the prediction uˆj|i made by capsule i.

同一时间的对数先验可以作为所有其他权重来进行判别性的学习。它们取决于两个胶囊的位置和类型，而不是取决于当前的输入图像。然后，初始耦合系数通过测量更高一层中每个胶囊 j 的当前输出v_ j 和胶囊 i 的预测u_ j|i 间的一致性迭代细化。

The agreement is simply the scalar product aij = vj.uˆj|i. This agreement is treated as if it was a log likelihood and is added to the initial logit, bij before computing the new values for all the coupling coefficients linking capsule i to higher level capsules.

一致性仅仅是标量积a_ij=v_ j.u_ j|i 。这种一致性被认为是一个对数似然比，并在对连接胶囊 i 和更高层胶囊的所有耦合系数计算新值之前，这种一致性被添加到初始逻辑b_ij 。

In convolutional capsule layers, each capsule outputs a local grid of vectors to each type of capsule in the layer above using different transformation matrices for each member of the grid as well as for each type of capsule.

在卷积胶囊层中，每个胶囊对高一层的每种类型胶囊的输出一个向量的局部网格，并对于网格的每一部分和胶囊的每种类型使用不同的变换矩阵。

3 Margin loss for digit existence

我们使用实例化向量的长度来表示胶囊实体存在的概率。当且仅当图像中有数字时，我们想要得到数字类 k 的顶层胶囊来获取一个长的实例化向量。为了允许多个数字，我们对每个数字胶囊 k 使用一个单独的边缘损失L_k。

其中，T_k=1，如果存在一个类 k 的数字，并且m+= 0.9并且m-= 0.1。缺失数字类别的损失的^下降权重会阻止最初学习缩小所有数字胶囊的激活向量的长度。我们用 ^= 0.5。总损失仅是所有数字胶囊损失的总和。

4 CapsNet architecture

一个简单的CapsNet架构如图1所示。架构可以简单的表示为仅有两个卷积层和一个全连接层组成。Conv1有256个9×9个卷积核，步长为1和ReLU激活。该层将像素的强度转换为之后会用于作为基本胶囊输入的局部特征探测器的激活。

图1：一个简单的三层CapsNet。这个模型给了深度卷积网络可比较的结果（如Chang and Chen [2015]），DigitCaps层中的每个胶囊的激活向量的长度表示每个类的实例呈现，并且用于计算分类损失。W_ ij 是PrimaryCapsule中每个u_i ，i属于（1，3266）和v_j，j属于（1，10）间的一个权重矩阵。

基本胶囊是多维实体的最低层，并且来自于一个相反的图形视角，激活初始胶囊体对应于反向表现过程。这是一个非常不同的计算方式，而非将实例化的部件组合在一起以形成熟悉的整体，这是胶囊被设计用来擅长做的事情。

第二层（PrimaryCapsules）是一个具有8维胶囊的32通道卷积的胶囊层，（即每个基本胶囊包含8卷积单元，用一个9×9的核和一个长度为2的步长）。每个初始胶囊的输出看作是所有256×81Conv1单元的输出，这些单元可以容纳与胶囊的中心位置重叠的区域。总的PrimaryCapsules有[32×6×6]胶囊输出（每个输出都是8维向量），每个胶囊都在[6×6]的网格中互相共享权重。PrimaryCapsules作为卷积层用公式1表示，作为其的非线性块。最终层（DigitCaps）有每个数字类是一个16维胶囊，这些胶囊的每一个都是从低一层的所有胶囊中接收输入。

我们仅仅有两个连续的胶囊层之间的路由（即PrimaryCapsules和DigitCaps）。由于Conv1的输出是1维的，在它的空间里没有任何方向可以达成一致。因此，不存在路由是用于Conv1和PrimaryCapsules之间。所有的路由模型（b_ij）初始化为到0。因此，最初的一个胶囊输出（u_i）以相等的概率（c_ij）被发送到所有有父胶囊体（v_0…v_9）。我们是在TensorFlow（Abadiet al. [2016]）进行实验，并且我们使用有着TensorFlow默认参数(包括指数衰减的学习率，最小化方程4中的边缘损失的总和)的Adam优化器（Kingma and Ba [2014]）。

4.1 Reconstruction as a regularization method

我们使用额外的重建损失来鼓励数字胶囊去编码输入数字的实例化参数。在训练过程中，我们将除了正确数字胶囊的激活向量外的所有激活向量都屏蔽掉。然后，我们使用这个激活向量重建输入图像。胶囊的输出被送入一个由3个模拟像素强度的全连接层组成的解码器，如图2所示。我们最小化逻辑单元和像素强度输出间的平方差异的总和。我们用0.0005缩小重建损失，以至于它在训练中不是边缘损失的主要部分。如图3所示，来自CapsNet的16维输出重建是稳定的，但是只保留了重要细节。

5 Capsules on MNIST

表1列出了在MNIST上对于不同CapsNet设置的测试错误率，并表明了路由和重建正则化的重要性。通过在胶囊向量中强制构成编码来增加重建正则化矩阵，从而提高了路由性能。

基线是一个分别有256，256，128通道的三个卷积层的标准CNN。每个的核为5x5，步长为1。最后一个卷积层后有两个大小为328，192的全连接层。最后一个全连接层用交叉熵损失连接着dropout和一个10类的softmax层。基线也被用Adam优化器在2像素转移MNIST上进行训练。基线的设计是为了在MNIST上取得最好的性能，同时保持计算成本接近CapsNet。在参数的数量上，基线有3540万，而CapsNet有820万，在没有重建子网络时是680万。

5.1 What the individual dimensions of a capsule represent

由于我们只传递一个数字的编码而将其他数字归零，一个数字胶囊的维数应该学会用实例化的类的数字的方式跨越变化的空间。这些变化包括行程厚度（stroke thickness）、倾斜（skew）和宽度（width）。

5.2 Robustness to Affine Transformations

除了在标准MNIST中可以看到的平移和一些自然变换，我们的模型从来没有用仿射变换训练过。一个提早停止的训练不足的CapsNet，在扩充的MNIST测试集实现了99.23%的精确度，在affnist测试集上实现了79%的精确度。有着相似的参数数量的传统卷积模型，在扩充的MNIST测试集实现了相似的精确度（99.22%），但是在affnist测试集上仅实现了66%的精确度。

6 Segmenting highly overlapping digits

重建过程如图5所示，CapsNet能够将图像分割为两个原始的数字。由于这个分割不是像素级的，当计算所有像素时，我们观察到模型能够正确的处理重叠部分（一个像素出现在两个数字上）。每个数字的位置和字体都在DigitCaps中进行编码。解码器已经学会根据编码重新构造一个数字。事实上，它能够不考虑重叠重建数字，这表明每个数字胶囊可以从来自PrimaryCapsules层的（votes）中获得字体和位置。

我们同时解码两个最活跃的DigitCaps胶囊，并得到两个图像。然后，通过对每个数字用非零强度分配任意像素，我们得到每个数字的分割结果。

如果它们中的一个没有任何其他支持，它将不会将一个像素分配给两个数字。

7 Other datasets

One drawback of Capsules which it shares with generative models is that it likes to account for everything in the image so it does better when it can model the clutter than when it just uses an additional “orphan” category in the dynamic routing. In CIFAR-10, the backgrounds are much too varied to model in a reasonable sized net which helps to account for the poorer performance.
与生成模型共享的胶囊的一个缺点是它喜欢对图像中的所有内容进行说明，因此它可以更好地模拟杂波，而不是在动态路由中使用额外的“孤立”类别。在CIFAR-10中，背景的差异太大，以至于不能对于一个可帮助解释较差性能的在合理的大小的网络中建模。

8 Discussion and previous work

into vectors of instantiation parameters of recognized fragments and then applying transformation matrices to the fragments to predict the instantiation parameters of larger fragments. Transformation matrices that learn to encode the intrinsic spatial relationship between a part and a whole constitute viewpoint invariant knowledge that automatically generalizes to novel viewpoints. Hinton et al. [2011] proposed transforming autoencoders to generate the instantiation parameters of the PrimaryCapsule layer and their system required transformation matrices to be supplied externally. We propose a complete system that also answers “how larger and more complex visual entities can be recognized by using agreements of the poses predicted by active, lower-level capsules”.

既然卷积神经网络已经成为了目标识别的主要方法，询问是存在指数效率低导致他们终止的问题是有意义的。一个好的概括新观点的卷积网络的替代者是困难的。建立处理转化的能力，但是对于仿射变换的其它维度，我们得在一个随维数增长而指数增长的网格内复制的特征检测器间选择，或者用一个相似的指数方法增加标记训练集的大小。胶囊（Hinton等人[2011]）通过将像素强度转换为识别片段的实例化参数，然后将变换矩阵用于片段来预测更大片段的实例化参数，从而避免指数低效。学习编码一个部分和一个整体间的内在空间关系的变换矩阵构成了自动泛化到新观点的观点不变知识。Hinton等人[2011]提出将自动编码器转换为产生PrimaryCapsule层的实例化参数，他们的系统要求在外部提供变换矩阵。我们提出了一个完整的系统，它也可以回答“用激活的、较底层的胶囊预测姿势的协议可以识别多么更大、更复杂的视觉实体”。

Capsules make a very strong representational assumption: At each location in the image, there is at most one instance of the type of entity that a capsule represents. This assumption, which was motivated by the perceptual phenomenon called “crowding” (Pelli et al. [2004]), eliminates the binding problem (Hinton [1981a]) and allows a capsule to use a distributed representation (its activity vector) to encode the instantiation parameters of the entity of that type at a given location. This distributed representation is exponentially more efficient than encoding the instantiation parameters by activating a point on a high-dimensional grid and with the right distributed representation, capsules can then take full advantage of the fact that spatial relationships can be modelled by matrix multiplies.

胶囊有一个很强的表示假设：在图像的每一个位置，最多有一个胶囊表示的实体类型的实例。这个假设是被一种称为“拥挤”的感知现象激励得到的，它消除了绑定问题（Hinton[1981a]），并允许胶囊使用分布式表示（它的激活向量）对一个给定位置的该类型实体的实例化参数进行编码。这种分布式表示比用激活一个高维网格上一点的编码实例化参数的效率要高得多，并且用正确的分布式表示，之后胶囊可以充分利用矩阵乘法建模空间关系的事实。

Capsules use neural activities that vary as viewpoint varies rather than trying to eliminate viewpoint variation from the activities. This gives them an advantage over “normalization” methods like spatial transformer networks (Jaderberg et al. [2015]): They can deal with multiple different affine transformations of different objects or object parts at the same time.

胶囊使用随着观点变化而变化的神经激活，而不是从激活中消除观点变化。这使得他们比“归一化”方法，如空间变压器网络（Jaderberg等人[2015]）更有优势：他们可以同时处理不同目标和目标部分的多个不同的仿射变换。

A How many routing iterations to use?

我们观察到，一般来说，更多的路由迭代增加了网络容量，并倾向于过度适应训练数据集。图A.2展示了在Cifar10上当训练1次迭代的路由和3次迭代的路由胶囊训练损失的比较。由图A.2和图A.1的分析，我们建议对所有实验进行3次迭代。

Reference

[1]: Sabour S, Frosst N, Hinton G E. Dynamic routing between capsules. Advances in Neural Information Processing Sys- tems, Long Beach, USA:MIT Press, 2017. 3856−3866
[2]: Dynamic Routing Between Capsules（译）
[3]: 如何看待Hinton的论文《Dynamic Routing Between Capsules》？

Code

[1]: naturomics/CapsNet-Tensorflow
[2]:XifengGuo/CapsNet-Keras.

论文阅读2 Dynamic Routing Between Capsules相关推荐

【Hinton大神新作】Dynamic Routing Between Capsules阅读笔记
Dynamic Routing Between Capsules 卷积信号处理之卷积,信号的叠加与分解 http://blog.csdn.net/lz0499/article/details/701 ...
capsule系列之Dynamic Routing Between Capsules
文章目录 1.背景 2.什么是capsule 3.capsule原理和结构 3.1.capsule结构 3.2.Dynamic Routing 算法 3.3.小部件 3.3.1.为耦合系数(coupl ...
初读Geoffrey Hinton颠覆之作《Dynamic Routing Between Capsules》
最近在搜资料时忽然看到一条消息,Hinton老爷子在NIPS 2017大会上放了大招,宣布要革CNN和反向传播的命.武林盟主在武林大会上要推翻自己之前的武学门派,另起炉灶,如此精彩的大戏怎能不吃瓜围观 ...
Dynamic Routing Between Capsules学习资料总结
Dynamic Routing Between Capsules(NIPS2017) Dynamic Routing Between Capsules这篇文章已被NIPS 2017接收.2017年10 ...
Capsule：Dynamic Routing Between Capsules
Capsule介绍 Hinton在<Dynamic Routing Between Capsules>中提出了capsule,以神经元向量代替了从前的单个神经元节点,以dynamic ro ...
【论文阅读】Dynamic Convolution: Attention over Convolution Kernels（CVPR2020）
论文题目:Dynamic Convolution: Attention over Convolution Kernels(CVPR2020) 论文地址:https://arxiv.org/abs/19 ...
[论文阅读]Spatio-Temporal Graph Routing for Skeleton-Based Action Recognition
Spatio-Temporal Graph Routing for Skeleton-Based Action Recognition paper: https://www.aaai.org/ojs/ ...
论文阅读--Adapted Dynamic Memory Network for Emotion Recognition in Conversation
Adapted Dynamic Memory Network for Emotion Recognition in Conversation Xing S , Mai S , Hu H . Adapt ...
【推荐系统-＞论文阅读】Dynamic Graph Neural Networks for Sequential Recommendation（用于序列推荐的动态图神经网络）
Dynamic Graph Neural Networks for Sequential Recommendation(用于序列推荐的动态图神经网络) Mengqi Zhang, Shu Wu,Mem ...

论文阅读2 Dynamic Routing Between Capsules