胶囊网络架构

by Nick Bourdakos

由Nick Bourdakos

了解胶囊网络-AI的诱人新架构 (Understanding Capsule Networks — AI’s Alluring New Architecture)

Convolutional neural networks have done an amazing job, but are rooted in problems. It’s time we started thinking about new solutions or improvements — and now, enter capsules.

卷积神经网络做得很棒，但是却植根于问题。是时候开始考虑新的解决方案或改进了-现在，进入胶囊。

Previously, I briefly discussed how capsule networks combat some of these traditional problems. For the past for few months, I’ve been submerging myself in all things capsules. I think it’s time we all try to get a deeper understanding of how capsules actually work.

之前，我简要讨论了胶囊网络如何解决这些传统问题。在过去的几个月中，我一直沉浸在万物胶囊中。我认为是时候大家都应该对胶囊的实际工作方式有更深入的了解。

In order to make it easier to follow along, I have built a visualization tool that allows you to see what is happening at each layer. This is paired with a simple implementation of the network. All of it can be found on GitHub here.

为了使后续操作更容易，我构建了可视化工具，可让您查看每一层的情况。这与网络的简单实现配对。所有这些都可以在GitHub上找到。

This is the CapsNet architecture. Don’t worry if you don’t understand what any of it means yet. I’ll be going through it layer by layer, with as much detail as I can possibly conjure up.

这是CapsNet架构。如果您还不了解任何含义，请不要担心。我将逐层遍历它，并尽我所能想出尽可能多的细节。

第0部分：输入 (Part 0: The Input)

The input into CapsNet is the actual image supplied to the neural net. In this example the input image is 28 pixels high and 28 pixels wide. But images are actually 3 dimensions, and the 3rd dimension contains the color channels.

CapsNet的输入是提供给神经网络的实际图像。在此示例中，输入图像的高度为28个像素，宽度为28个像素。但是图像实际上是3维，而3维包含颜色通道。

The image in our example only has one color channel, because it’s black and white. Most images you are familiar with have 3 or 4 channels, for Red-Green-Blue and possibly an additional channel for Alpha, or transparency.

本示例中的图像只有一个颜色通道，因为它是黑白的。您熟悉的大多数图像都有3或4个通道，分别用于红-绿-蓝，还有可能用于Alpha或透明度的附加通道。

Each one of these pixels is represented as a value from 0 to 255 and stored in a 28x28x1 matrix [28, 28, 1]. The brighter the pixel, the larger the value.

这些像素中的每个像素都表示为0到255之间的值，并存储在28x28x1矩阵[28、28、1]中。像素越亮，值越大。

第1a部分：卷积 (Part 1a: Convolutions)

The first part of CapsNet is a traditional convolutional layer. What is a convolutional layer, how does it work, and what is its purpose?

CapsNet的第一部分是传统的卷积层。什么是卷积层，它是如何工作的，目的是什么？

The goal is to extract some extremely basic features from the input image, like edges or curves.

目的是从输入图像中提取一些极其基本的特征，例如边缘或曲线。

How can we do this?

我们应该怎么做？

Let’s think about an edge:

让我们考虑一个优势：

If we look at a few points on the image, we can start to pick up a pattern. Focus on the colors to the left and right of the point we are looking at:

如果我们查看图像上的几个点，就可以开始选择图案。专注于我们要查看的点左右两侧的颜色：

You might notice that they have a larger difference if the point is an edge:

您可能会注意到，如果点是边，则它们之间的差异更大：

255 - 114 = 141
114 - 153 = -39
153 - 153 = 0
255 - 255 = 0

What if we went through each pixel in the image and replaced its value with the value of the difference of the pixels to the left and right of it? In theory, the image should become all black except for the edges.

如果我们遍历图像中的每个像素并将其值替换为图像左右两侧的像素差值，该怎么办？从理论上讲，除边缘外，图像应全部变黑。

We could do this by looping through every pixel in the image:

我们可以通过遍历图像中的每个像素来做到这一点：

for pixel in image {result[pixel] = image[pixel - 1] - image[pixel + 1]
}

But this isn’t very efficient. We can instead use something called a “convolution.” Technically speaking, it’s a “cross-correlation,” but everyone likes to call them convolutions.

但这不是很有效。相反，我们可以使用一种称为“卷积”的东西。从技术上讲，这是“互相关”，但每个人都喜欢称其为卷积。

A convolution is essentially doing the same thing as our loop, but it takes advantage of matrix math.

卷积本质上与我们的循环在做相同的事情，但是它利用了矩阵数学的优势。

A convolution is done by lining up a small “window” in the corner of the image that only lets us see the pixels in that area. We then slide the window across all the pixels in the image, multiplying each pixel by a set of weights and then adding up all the values that are in that window.

通过在图像的一角排列一个小的“窗口”来完成卷积，仅使我们看到该区域中的像素。然后，我们将窗口滑过图像中的所有像素，将每个像素乘以一组权重，然后将该窗口中的所有值相加。

This window is a matrix of weights, called a “kernel.”

该窗口是权重矩阵，称为“内核”。

We only care about 2 pixels, but when we wrap the window around them it will encapsulate the pixel between them.

我们只关心2个像素，但是当我们将窗口包裹在它们周围时，它将封装它们之间的像素。

Window:
┌─────────────────────────────────────┐
│ left_pixel middle_pixel right_pixel │
└─────────────────────────────────────┘

Can you think of a set of weights that we can multiply these pixels by so that their sum adds up to the value we are looking for?

您能想到一组可以将这些像素相乘的权重，以便它们的总和等于我们所寻找的值吗？

Window:
┌─────────────────────────────────────┐
│ left_pixel middle_pixel right_pixel │
└─────────────────────────────────────┘
(w1 * 255) + (w2 * 255) + (w3 * 114) = 141

Spoilers below!

剧透下面！

│            │            ││            │            ││            │            ││            │            ││            │            │
\│/          \│/          \│/V            V            V

We can do something like this:

我们可以做这样的事情：

Window:
┌─────────────────────────────────────┐
│ left_pixel middle_pixel right_pixel │
└─────────────────────────────────────┘
(1 * 255) + (0 * 255) + (-1 * 114) = 141

With these weights, our kernel will look like this:

有了这些权重，我们的内核将如下所示：

kernel = [1  0 -1]

However, kernels are generally square — so we can pad it with more zeros to look like this:

但是，内核通常是正方形的-因此我们可以用更多的零填充它，如下所示：

kernel = [[0  0  0][1  0 -1][0  0  0]
]

Here’s a nice gif to see a convolution in action:

这是一个很好的gif，可以看到实际的卷积：

Note: The dimension of the output is reduced by the size of the kernel plus 1. For example:(7 — 3) + 1 = 5 (more on this in the next section)

注意：输出的维数减去内核的大小加1。例如： (7 — 3) + 1 = 5 (在下一节中将对此进行详细说明)

Here’s what the original image looks like after doing a convolution with the kernel we crafted:

在与我们制作的内核进行卷积后，原始图像如下所示：

You might notice that a couple edges are missing. Specifically, the horizontal ones. In order to highlight those, we would need another kernel that looks at pixels above and below. Like this:

您可能会注意到缺少一些边缘。具体来说，是水平的。为了突出显示这些内容，我们需要另一个内核来查看上方和下方的像素。像这样：

kernel = [[0  1  0][0  0  0][0 -1  0]
]

Also, both of these kernels won’t work well with edges of other angles or edges that are blurred. For that reason, we use many kernels (in our CapsNet implementation, we use 256 kernels). And the kernels are normally larger to allow for more wiggle room (our kernels will be 9x9).

而且，这两个内核都无法与其他角度的边缘或模糊的边缘配合使用。因此，我们使用许多内核(在CapsNet实现中，我们使用256个内核)。而且内核通常更大，以留出更多的摆动空间(我们的内核将为9x9)。

This is what one of the kernels looked like after training the model. It’s not very obvious, but this is just a larger version of our edge detector that is more robust and only finds edges that go from bright to dark.

这是训练模型后的内核之一。这不是很明显，但这只是我们的边缘检测器的较大版本，它更加健壮，只能发现从亮到暗的边缘。

kernel = [[ 0.02 -0.01  0.01 -0.05 -0.08 -0.14 -0.16 -0.22 -0.02][ 0.01  0.02  0.03  0.02  0.00 -0.06 -0.14 -0.28  0.03][ 0.03  0.01  0.02  0.01  0.03  0.01 -0.11 -0.22 -0.08][ 0.03 -0.01 -0.02  0.01  0.04  0.07 -0.11 -0.24 -0.05][-0.01 -0.02 -0.02  0.01  0.06  0.12 -0.13 -0.31  0.04][-0.05 -0.02  0.00  0.05  0.08  0.14 -0.17 -0.29  0.08][-0.06  0.02  0.00  0.07  0.07  0.04 -0.18 -0.10  0.05][-0.06  0.01  0.04  0.05  0.03 -0.01 -0.10 -0.07  0.00][-0.04  0.00  0.04  0.05  0.02 -0.04 -0.02 -0.05  0.04]
]

Note: I’ve rounded the values because they are quite large, for example 0.01783941

注意：我对这些值进行了四舍五入，因为它们很大，例如0.01783941

Luckily, we don’t have to hand-pick this collection of kernels. That is what training does. The kernels all start off empty (or in a random state) and keep getting tweaked in the direction that makes the output closer to what we want.

幸运的是，我们不必手工挑选这些内核集合。那就是训练所要做的。内核全部开始时都是空的(或处于随机状态)，并不断调整方向，以使输出更接近我们想要的输出。

This is what the 256 kernels ended up looking like (I colored them as pixels so it’s easier to digest). The more negative the numbers, the bluer they are. 0 is green and positive is yellow:

这就是256个内核最终的样子(我将它们着色为像素，这样更易于消化)。数字越负，它们越蓝。 0为绿色，正为黄色：

After we filter the image with all of these kernels, we end up with a fat stack of 256 output images.

用所有这些内核过滤图像后，最终得到了256个输出图像的大量堆栈。

第1b部分：ReLU (Part 1b: ReLU)

ReLU (formally known as Rectified Linear Unit) may sound complicated, but it’s actually quite simple. ReLU is an activation function that takes in a value. If it’s negative it becomes zero, and if it’s positive it stays the same.

ReLU(正式称为整流线性单元)听起来很复杂，但实际上非常简单。 ReLU是一个接受值的激活函数。如果为负，则变为零；如果为正，则保持不变。

In code:

在代码中：

x = max(0, x)

And as a graph:

作为图形：

We apply this function to all of the outputs of our convolutions.

我们将此函数应用于卷积的所有输出。

Why do we do this? If we don’t apply some sort of activation function to the output of our layers, then the entire neural net could be described as a linear function. This would mean that all this stuff we are doing is kind of pointless.

我们为什么要做这个？如果我们不对层的输出应用某种激活函数，则整个神经网络可以描述为线性函数。这意味着我们正在做的所有事情都是毫无意义的。

Adding a non-linearity allows us to describe all kinds of functions. There are many different types of function we could apply, but ReLU is the most popular because it’s very cheap to perform.

添加非线性可以使我们描述各种功能。我们可以应用许多不同类型的功能，但是ReLU最受欢迎，因为它执行起来非常便宜。

Here are the outputs of ReLU Conv1 layer:

这是ReLU Conv1层的输出：

第2a部分：PrimaryCaps (Part 2a: PrimaryCaps)

The PrimaryCaps layer starts off as a normal convolution layer, but this time we are convolving over the stack of 256 outputs from the previous convolutions. So instead of having a 9x9 kernel, we have a 9x9x256 kernel.

PrimaryCaps层从正常的卷积层开始，但是这次我们在以前的卷积的256个输出的堆栈上进行卷积。因此，我们没有9x9内核，而是拥有9x9x256内核。

So what exactly are we looking for?

那么，我们到底在寻找什么？

In the first layer of convolutions we were looking for simple edges and curves. Now we are looking for slightly more complex shapes from the edges we found earlier.

在卷积的第一层中，我们正在寻找简单的边缘和曲线。现在，我们正在从我们之前发现的边缘中寻找稍微复杂的形状。

This time our “stride” is 2. That means instead of moving 1 pixel at a time, we take steps of 2. A larger stride is chosen so that we can reduce the size of our input more rapidly:

这次，我们的“跨距”为2。这意味着我们不会每次移动1个像素，而是以2为步长。选择更大的跨距，以便我们可以更快地减小输入的大小：

Note: The dimension of the output would normally be 12, but we divide it by 2, because of the stride. For example: ((20 — 9) + 1) / 2 = 6

注意：输出的尺寸通常为12，但由于跨度较大，我们将其除以2。例如： ((20 — 9) + 1) / 2 = 6

We will convolve over the outputs another 256 times. So we will end up with a stack of 256 6x6 outputs.

我们将对输出再卷积256次。因此，我们最终将获得256个6x6输出的堆栈。

But this time we aren’t satisfied with just some lousy plain old numbers.

但是这次，我们对一些糟糕的普通旧数据不满意。

We’re going to cut the stack up into 32 decks with 8 cards each deck.

我们将堆栈分成32个卡组，每个卡组8张卡。

We can call this deck a “capsule layer.”

我们可以将此甲板称为“胶囊层”。

Each capsule layer has 36 “capsules.”

每个胶囊层有36个“胶囊”。

If you’re keeping up (and are a math wiz), that means each capsule has an array of 8 values. This is what we can call a “vector.”

如果您要保持精力充沛(并且是数学天才)，则意味着每个胶囊有8个值的数组。这就是我们所说的“向量”。

Here’s what I’m talking about:

这就是我在说的：

These “capsules” are our new pixel.

这些“胶囊”是我们的新像素。

With a single pixel, we could only store the confidence of whether or not we found an edge in that spot. The higher the number, the higher the confidence.

对于单个像素，我们只能存储是否可以在该点找到边缘的置信度。数字越高，置信度越高。

With a capsule we can store 8 values per location! That gives us the opportunity to store more information than just whether or not we found a shape in that spot. But what other kinds of information would we want to store?

使用胶囊，我们可以在每个位置存储8个值！这给我们提供了存储更多信息的机会，而不仅仅是我们是否在该点找到了形状。但是，我们还想存储什么其他类型的信息？

When looking at the shape below, what can you tell me about it? If you had to tell someone else how to redraw it, and they couldn’t look at it, what would you say?

当看下面的形状时，您能告诉我什么？如果您不得不告诉别人如何重绘它，而他们却看不到它，您会说什么？

This image is extremely basic, so there are only a few details we need to describe the shape:

该图像非常基础，因此我们只需要描述几个细节即可：

Type of shape形状类型
Position位置
Rotation回转
Color颜色
Size尺寸

We can call these “instantiation parameters.” With more complex images we will end up needing more details. They can include pose (position, size, orientation), deformation, velocity, albedo, hue, texture, and so on.

我们可以称这些为“实例化参数”。对于更复杂的图像，我们最终将需要更多细节。它们可以包括姿势(位置，大小，方向)，变形，速度，反照率，色相，纹理等。

You might remember that when we made a kernel for edge detection, it only worked on a specific angle. We needed a kernel for each angle. We could get away with it when dealing with edges because there are very few ways to describe an edge. Once we get up to the level of shapes, we don’t want to have a kernel for every angle of rectangles, ovals, triangles, and so on. It would get unwieldy, and would become even worse when dealing with more complicated shapes that have 3 dimensional rotations and features like lighting.

您可能还记得，当我们制作一个用于边缘检测的内核时，它仅在特定角度起作用。我们需要每个角度的内核。在处理边缘时，我们可以避免使用它，因为很少有描述边缘的方法。一旦达到了形状的水平，我们就不希望为矩形，椭圆形，三角形等的每个角度都拥有一个内核。它将变得笨拙，并且在处理具有3维旋转和照明等特征的更复杂形状时会变得更糟。

That’s one of the reasons why traditional neural nets don’t handle unseen rotations very well:

这就是传统神经网络不能很好地处理看不见的旋转的原因之一：

As we go from edges to shapes and from shapes to objects, it would be nice if we had more room to store this extra useful information.

当我们从边缘到形状，从形状到对象时，如果我们有更多的空间来存储这些额外有用的信息，那将是很好的。

Here is a simplified comparison of 2 capsule layers (one for rectangles and the other for triangles) vs 2 traditional pixel outputs:

这是2个胶囊层(一个用于矩形，另一个用于三角形)与2个传统像素输出的简化比较：

Like a traditional 2D or 3D vector, this vector has an angle and a length. The length describes the probability, and the angle describes the instantiation parameters. In the example above, the angle actually matches the angle of the shape, but that’s not normally the case.

像传统的2D或3D矢量一样，此矢量具有角度和长度。长度描述概率，角度描述实例化参数。在上面的示例中，角度实际上与形状的角度匹配，但是通常情况并非如此。

In reality it’s not really feasible (or at least easy) to visualize the vectors like above, because these vectors are 8 dimensional.

实际上，像上面那样可视化矢量并不是真正可行(或至少很容易)，因为这些矢量是8维的。

Since we have all this extra information in a capsule, the idea is that we should be able to recreate the image from them.

由于我们将所有这些额外信息保存在一个胶囊中，因此我们的想法是，我们应该能够从它们中重新创建图像。

Sounds great, but how do we coax the network into actually wanting to learn these things?

听起来不错，但是我们如何才能使网络真正地想要学习这些东西呢？

When training a traditional CNN, we only care about whether or not the model predicts the right classification. With a capsule network, we have something called a “reconstruction.” A reconstruction takes the vector we created and tries to recreate the original input image, given only this vector. We then grade the model based on how close the reconstruction matches the original image.

在训练传统的CNN时，我们只关心模型是否预测正确的分类。有了胶囊网络，我们就有了所谓的“重建”。重建将采用我们创建的矢量，并尝试仅给定此矢量来重新创建原始输入图像。然后，我们根据重建与原始图像匹配的接近程度对模型进行分级。

I will go into more detail on this in the coming sections, but here is a simple example:

在接下来的部分中，我将对此进行更详细的说明，但这是一个简单的示例：

第2b部分：压扁 (Part 2b: Squashing)

After we have our capsules, we are going to perform another non-linearity function on it (like ReLU), but this time the equation is a bit more involved. The function scales the values of the vector so that only the length of the vector changes, not the angle. This way we can make the vector between 0 and 1 so it’s an actual probability.

装好胶囊后，我们将在其上执行另一个非线性函数(如ReLU)，但是这次方程式要复杂得多。该函数缩放矢量的值，以便仅矢量的长度发生变化，而不改变角度。这样，我们可以使向量在0到1之间，这是一个实际的概率。

This is what lengths of the capsule vectors look like after squashing. At this point it’s almost impossible to guess what each capsule is looking for.

这就是挤压后胶囊载体的长度。在这一点上，几乎不可能猜测每个胶囊在寻找什么。

第3部分：协议路由 (Part 3: Routing by Agreement)

The next step is to decide what information to send to the next level. In traditional networks, we would probably do something like “max pooling.” Max pooling is a way to reduce size by only passing on the highest activated pixel in the region to the next layer.

下一步是确定要发送到下一级的信息。在传统网络中，我们可能会做类似“最大池”的操作。最大池化是通过仅将区域中激活的最高像素传递到下一层来减小大小的一种方法。

However, with capsule networks we are going to do something called routing by agreement. The best example of this is the boat and house example illustrated by Aurélien Géron in this excellent video. Each capsule tries to predict the next layer’s activations based on itself:

但是，对于胶囊网络，我们将执行称为协议路由的操作。最好的例子是AurélienGéron在这段出色的视频中展示的船屋示例。每个胶囊尝试根据自身预测下一层的激活：

Looking at these predictions, which object would you choose to pass on to the next layer (not knowing the input)? Probably the boat, right? both the rectangle capsule and the triangle capsule agree on what the boat would look like. But they don’t agree on how the house would look, so it’s not very likely that the object is a house.

查看这些预测，您会选择将哪个对象传递到下一层(不知道输入)？大概是船吧？矩形囊和三角形囊都同意船的外观。但是他们不同意房子的外观，因此物体不太可能是房子。

With routing by agreement, we only pass on the useful information and throw away the data that would just add noise to the results. This gives us a much smarter selection than just choosing the largest number, like in max pooling.

通过协议路由，我们仅传递有用的信息，而丢弃只会给结果增加噪音的数据。这给我们提供了比仅选择最大数目(例如在最大池中)更聪明的选择。

With traditional networks, misplaced features don’t faze it:

在传统网络中，放错位置的功能不会使它困惑：

With capsule networks, the features wouldn’t agree with each other:

使用胶囊网络时，功能将彼此不一致：

Hopefully, that works intuitively. However, how does the math work?

希望这可以直观地工作。但是，数学如何工作？

We have 10 different digit classes that we are predicting:

我们预测了10种不同的数字类别：

0, 1, 2, 3, 4, 5, 6, 7, 8, 9

Note: In the boat and house example we were predicting 2 objects, but now we are predicting 10.

注意：在船屋示例中，我们预测了2个对象，但现在预测了10个对象。

Unlike in the boat and the house example, the predictions aren’t actually images. Instead, we are trying to predict the vector that describes the image.

与船只和房屋示例不同，预测实际上并不是图像。相反，我们试图预测描述图像的向量。

The capsule’s predictions for each class are made by multiplying it’s vector by a matrix of weights for each class that we are trying to predict.

胶囊对每个类别的预测是通过将向量乘以我们要预测的每个类别的权重矩阵得出的。

Remember that we have 32 capsule layers, and each capsule layer has 36 capsules. That means we have a total of 1,152 capsules.

请记住，我们有32个胶囊层，每个胶囊层有36个胶囊。这意味着我们总共有1,152粒胶囊。

cap_1 * weight_for_0 = prediction
cap_1 * weight_for_1 = prediction
cap_1 * weight_for_2 = prediction
cap_1 * ...
cap_1 * weight_for_9 = predictioncap_2 * weight_for_0 = prediction
cap_2 * weight_for_1 = prediction
cap_2 * weight_for_2 = prediction
cap_2 * ...
cap_2 * weight_for_9 = prediction...cap_1152 * weight_for_0 = prediction
cap_1152 * weight_for_1 = prediction
cap_1152 * weight_for_2 = prediction
cap_1152 * ...
cap_1152 * weight_for_9 = prediction

You will end up with a list of 11,520 predictions.

您最终将得到11,520个预测的列表。

Each weight is actually a 16x8 matrix, so each prediction is a matrix multiplication between the capsule vector and this weight matrix:

每个权重实际上是一个16x8矩阵，因此每个预测都是胶囊向量与该权重矩阵之间的矩阵乘法：

As you can see, our prediction is a 16 degree vector.

如您所见，我们的预测是16度向量。

Where does the 16 come from? It’s an arbitrary choice, just like 8 was for our original capsules.

16是哪里来的？这是一个任意选择，就像我们的原始胶囊有8个一样。

But it should be noted that we want to increase the number of dimensions of our capsules the deeper we get into the network. This should make sense intuitively, because the deeper we go the more complex our features become and the more parameters we need to recreate them. For example, you will need more information to describe an entire face than just a person’s eye.

但是需要注意的是，我们要越深入网络就增加胶囊尺寸的数量。这在直观上应该是有意义的，因为我们越深入，功能就越复杂，并且需要更多的参数来重新创建它们。例如，您将需要更多信息来描述整个脸部，而不仅仅是一个人的眼睛。

The next step is to figure out which of these 11,520 predictions agree with each other the most.

下一步是找出这11,520个预测中的哪个彼此最一致。

It can be difficult to visualize a solution to this when we think in terms of high dimensional vectors. For the sake of sanity, let’s start off by pretending our vectors are just points in 2 dimensional space:

当我们考虑高维向量时，可能很难形象地解决此问题。为了保持理智，让我们从假设向量只是二维空间中的点开始：

We start off by calculating the mean of all of the points. Each point starts out with equal importance:

我们首先计算所有点的平均值。每一点都以同等重要开始：

We then can measure the distance between every point from the mean. The further the point is away from the mean, the less important that point becomes:

然后，我们可以测量均值与每个点之间的距离。该点离均值越远，该点变得越不重要：

We then recalculate the mean, this time taking into account the point’s importance:

然后我们重新计算平均值，这次考虑到该点的重要性：

We end up going through this cycle 3 times:

我们最终经历了3次循环：

As you can see, as we go through this cycle, the points that don’t agree with the others start to disappear. The highest agreeing points end up getting passed on to the next layer with the highest activations.

如您所见，随着我们经历这个周期，与其他观点不一致的观点开始消失。最高的一致点最终被传递到具有最高激活率的下一层。

第4部分：DigitCaps (Part 4: DigitCaps)

After agreement, we end up with ten 16 dimensional vectors, one vector for each digit. This matrix is our final prediction. The length of the vector is the confidence of the digit being found — the longer the better. The vector can also be used to generate a reconstruction of the input image.

达成协议后，我们得到10个16维向量，每个数字一个向量。这个矩阵是我们的最终预测。向量的长度是找到的数字的置信度-越长越好。该向量还可以用于生成输入图像的重建。

This is what the lengths of the vectors look like with the input of 4:

输入为4时，向量的长度如下所示：

The fifth block is the brightest, which means high confidence. Remember that 0 is the first class, meaning 4 is our predicted class.

第五块是最亮的，这意味着高置信度。请记住，0是第一类，意味着4是我们的预测类。

第五部分：重建 (Part 5: Reconstruction)

The reconstruction portion of the implementation isn’t very interesting. It’s just a few fully connected layers. But the reconstruction itself is very cool and fun to play around with.

实现的重构部分不是很有趣。只是几个完全连接的层。但是重建本身非常酷，而且很有趣。

If we reconstruct our 4 input from its vector, this is what we get:

如果我们从其向量重构我们的4个输入，那么我们将得到：

If we manipulate the sliders (the vector), we can see how each dimension affects the 4:

如果我们操纵滑块(向量)，我们可以看到每个尺寸如何影响4：

I recommend cloning the visualization repo to play around with different inputs and see how the sliders affect the reconstruction:

我建议克隆可视化存储库以使用不同的输入，并查看滑块如何影响重建：

git clone https://github.com/bourdakos1/CapsNet-Visualization.git
cd CapsNet-Visualization
pip install -r requirements.txt

Run the tool:

运行工具：

python run_visualization.py

Then point your browser to: http://localhost:5000

然后将浏览器指向： http：// localhost：5000

最后的想法 (Final Thoughts)

I think that the reconstructions from capsule networks are stunning. Even though the current model is only trained on simple digits, it makes my mind run with the possibilities that a matured architecture trained on a larger dataset could achieve.

我认为胶囊网络的重建令人惊叹。即使当前模型仅使用简单的数字进行训练，但它使我的思维充满了可能在较大的数据集上训练的成熟体系结构可以实现的可能性。

I’m very curious to see how manipulating the reconstruction vectors of a more complicated image would affect it. For that reason, my next project is to get capsule networks to work with the CIFAR and smallNORB datasets.

我很好奇看到操纵更复杂的图像的重建矢量会如何影响它。因此，我的下一个项目是使胶囊网络与CIFAR和smallNORB数据集一起使用。

Thanks for reading! If you have any questions, feel free to reach out at bourdakos1@gmail.com, connect with me on LinkedIn, or follow me on Medium and Twitter.

谢谢阅读！如果您有任何疑问，请随时与我们联系bourdakos1@gmail.com，在LinkedIn上与我联系，或在Medium和Twitter上关注我。

If you found this article helpful, it would mean a lot if you gave it some applause? and shared to help others find it! And feel free to leave a comment below.

如果您觉得这篇文章很有帮助，那么如果您鼓掌鼓掌，那意义重大吗？并分享以帮助他人找到它！并随时在下面发表评论。

翻译自: https://www.freecodecamp.org/news/understanding-capsule-networks-ais-alluring-new-architecture-bdb228173ddc/