alexnet 结构

In my last blog, I gave a detailed explanation of the LeNet-5 architecture. In this blog, we’ll explore the enhanced version of it which is AlexNet.

在上一个博客中,我详细介绍了 LeNet-5体系结构 在此博客中,我们将探索它的增强版本AlexNet。

AlexNet was the winner of the 2012 Imagenet Large Scale Visual Recognition Challenge(ILSVRC-2012) submitted by Alex Krizhevsky, Ilya Sutskever, and. Geoffrey E. Hinton and this model beat its nearest contender by more than a 10% error rate. These visual recognition challenges encourage the researchers to monitor the progress of computer vision research across the globe.

一个 lexNet提交由Alex Krizhevsky, 伊利亚Sutskever 2012年Imagenet大型视觉识别挑战(ILSVRC-2012)的冠军,和。 杰弗里·欣顿(Geoffrey E. Hinton)和这个模型以10%以上的错误率击败了其最接近的竞争者。 这些视觉识别挑战鼓励研究人员监视全球计算机视觉研究的进展。

Before we proceed further let’s discuss the data present in the Imagenet dataset. It contains images from dogs, horses, cars, etc. It contains 1000 classes and each class contains thousands of images. So in total, there are approximately 1.2 million high-resolution images in this dataset used by researchers for training, testing, and validating the model designed by researchers.

在继续进行之前,让我们讨论Imagenet数据集中存在的数据。 它包含来自狗,马,汽车等的图像。它包含1000个类别 ,每个类别包含数千个图像。 因此,在该数据集中,总共有大约120万张高分辨率图像被研究人员用于训练,测试和验证研究人员设计的模型。

Let’s dive into the AlexNet Architecture

让我们深入研究AlexNet架构

Photo by Dylan Nolte on Unsplash
D ylan Nolte在Unsplash上拍摄的照片

The AlexNet neural network architecture consists of 8 learned layers of which 5 are convolution layers, few are max-pooling layers, 3 are fully connected layers, and the output layer is a 1000 channel softmax layer. The pooling used here is Max pool.

AlexNet神经网络体系结构由8个 学习层组成 ,其中5个卷积层 ,很少是最大合并层3个全连接层,输出层是1000通道softmax层。 这里使用的池是最大池。

Why 1000 channels of softmax layer are taken??

为什么要使用1000个softmax层通道?

This is because the Imagenet dataset contains 1000 different classes of images, so at the final output layer we have one node for each of these 1000 categories and the output layer is the softmax output layer.

这是因为Imagenet数据集包含1000种不同类别的图像,因此在最终输出层中,这1000个类别中的每一个都有一个节点,并且输出层是softmax输出层。

The basic architecture of AlexNet is as shown below:

AlexNet的基本体系结构如下所示:

Image from Anh H Reynolds Blog图片来自Anh H Reynolds博客

The input to the AlexNet network is a 227 x 227 size RGB image, so it’s having 3 different channels- red, green, and blue.

AlexNet网络的输入是227 x 227尺寸的RGB图像,因此它具有3个不同的通道-红色,绿色和蓝色。

Then we have the First Convolution Layer in the AlexNet that has 96 different kernels with each kernel’s size of 11 x 11 and with stride equals 4. So the output of the first convolution layer gives you 96 different channels or feature maps because there are 96 different kernels and each feature map contains features of size 55 x 55.

然后在AlexNet中有第一个卷积层 ,其中有96个不同的内核,每个内核的大小为11 x 11,步幅等于4。所以第一个卷积层的输出为您提供96个不同的通道或特征图,因为有96种不同的内核和每个要素图都包含大小为55 x 55的要素。

计算: (Calculations:)

  • Size of Input: N = 227 x 227

    输入大小:N = 227 x 227

  • Size of Convolution Kernels: f = 11 x 11

    卷积核的大小:f = 11 x 11

  • No. of Kernels: 96

    仁数: 96

  • Strides: S = 4

    步幅:S = 4

  • Padding: P = 0

    填充:P = 0

Size of each feature map = [(N — f + 2P)/S] + 1

每个特征图的大小= [(N — f + 2P)/ S] + 1

  • Size of each feature map = (227–11+0)/4+1 = 55

    每个特征图的大小= (227–11 + 0)/ 4 + 1 = 55

So every feature map after the first convolution layer is of the size 55 x 55.

因此,第一个卷积层之后的每个要素图的大小均为55 x 55。

After this convolution, we have an Overlapping Max Pool Layer, where the max-pooling is done over a window of 3 x 3, and stride equals 2. So, here we’ll find that as our max pooling window is of size 3 x 3 but the stride is 2 that means max pooling will be done over an overlapped window. After this pooling, the size of the feature map is reduced to 27 x 27 and the number of feature channels remains 96.

卷积之后,我们有了一个重叠的最大池层,其中最大池在3 x 3的窗口上完成,步幅等于2。因此,在这里,我们会发现,由于我们的最大池窗口的大小为3 x 3,但跨度为2,这意味着将在重叠的窗口上进行最大池化。 合并之后,要素地图的大小将减小为27 x 27,要素通道的数量仍为96。

计算: (Calculations:)

  • Size of Input: N = 55 x 55

    输入大小:N = 55 x 55

  • Size of Convolution Kernels: f = 3 x 3

    卷积核的大小:f = 3 x 3

  • Strides: S = 2

    步幅:S = 2

  • Padding: P = 0

    填充:P = 0

  • Size of each feature map = (55–3+0)/2+1 = 27

    每个特征图的大小= (55–3 + 0)/ 2 + 1 = 27

So every feature map after this pooling is of the size 27 x 27.

因此,合并后的每个要素地图的大小均为27 x 27。

Then we have Second Convolution Layer where kernel size is 5 x 5 and in this case, we use the padding of 2 so that the output of the convolution layer remains the same as the input feature size. Thus, the size of feature maps generated by this second convolution layer is 27 x 27 and the number of kernels used in this case is 256 so that means from this convolution layer output we’ll get 256 different channels or feature maps and every feature map will be of size 27 x 27.

然后我们有了第二卷积层 ,其内核大小为5 x 5,在这种情况下,我们使用填充2,以便卷积层的输出与输入要素大小保持相同。 因此,第二个卷积层生成的特征图的大小为27 x 27,在这种情况下使用的内核数为256,这意味着从该卷积层输出中,我们将获得256个不同的通道或特征图,每个特征图尺寸为27 x 27。

计算: (Calculations:)

  • Size of Input: N = 27 x 27

    输入大小:N = 27 x 27

  • Size of Convolution Kernels: f = 5 x 5

    卷积核的大小:f = 5 x 5

  • No. of Kernels: 256

    仁数: 256

  • Strides: S = 1

    步幅:S = 1

  • Padding: P = 2

    填充:P = 2

  • Size of each feature map = (27–5+4)/1+1 = 27

    每个特征图的大小= (27-5 + 4)/ 1 + 1 = 27

So every feature map after the second convolution layer is of the size 27 x 27.

因此,第二个卷积层之后的每个特征图的大小均为27 x 27。

Now again we have an Overlapping Max Pool Layer, where the max-pooling is again done over a window of 3 x 3, and stride equals 2 which means max-pooling is done over overlapping windows and output of this become 13 x 13 feature maps and number of channels we get is 256.

现在我们又有了一个“ 重叠最大池层”,其中最大池再次在3 x 3的窗口上完成,步幅等于2,这意味着最大池在重叠的窗口上进行,其输出变为13 x 13特征图我们获得的频道数是256。

计算: (Calculations:)

  • Size of Input: N = 27 x 27

    输入大小:N = 27 x 27

  • Size of Convolution Kernels: f = 3 x 3

    卷积核的大小:f = 3 x 3

  • Strides: S = 2

    步幅:S = 2

  • Padding: P = 0

    填充:P = 0

  • Size of each feature map = (27–3+0)/2+1 = 13

    每个特征图的大小= (27–3 + 0)/ 2 + 1 = 13

So every feature map after this pooling is of the size 13 x13.

因此,此池化后的每个特征图的大小均为13 x13。

Then we have Three Consecutive Convolution Layers of which the first convolution layer is having the kernel size of 3 x 3 with padding equal to 1 and 384 kernels give you 384 feature maps of size 13 x 13 which passes through the next convolution layer.

然后我们有三个连续的卷积层 ,其中第一个卷积层的内核大小为3 x 3,边距等于1,384个内核给您384个大小为13 x 13的特征贴图,这些特征贴图将通过下一个卷积层。

计算: (Calculations:)

  • Size of Input: N = 13 x 13

    输入大小:N = 13 x 13

  • Size of Convolution Kernels: f = 3 x 3

    卷积核的大小:f = 3 x 3

  • No. of Kernels: 384

    仁数: 384

  • Strides: S = 1

    步幅:S = 1

  • Padding: P = 1

    填充:P = 1

  • Size of each feature map = (13–3+2)/1+1 = 13

    每个特征图的大小= (13–3 + 2)/ 1 + 1 = 13

In the second convolution, the kernel size is 13 x 13 with padding equal to 1 and it has 384 number of kernels that means the output of this convolution layer will have 384 channels or 384 feature maps and every feature map is of size 13 x 13. As we have given padding equals 1 for a 3 x 3 kernel size and that’s the reason the size of every feature map at the output of this convolution layer is remaining the same as the size of the feature maps which are inputted to this convolution layer.

第二次卷积中 ,内核大小为13 x 13,填充等于1,并且具有384个内核数,这意味着该卷积层的输出将具有384个通道或384个特征图,每个特征图的大小为13 x 13 。因为我们给定了3 x 3内核大小的padding等于1,所以这就是在此卷积层输出处每个特征图的大小与输入到该卷积层中的特征图的大小相同的原因。

计算: (Calculations:)

  • Size of Input: N = 13 x 13

    输入大小:N = 13 x 13

  • Size of Convolution Kernels: f = 3 x 3

    卷积核的大小:f = 3 x 3

  • No. of Kernels: 384

    仁数: 384

  • Strides: S = 1

    步幅:S = 1

  • Padding: P = 1

    填充:P = 1

  • Size of each feature map = (13–3+2)/1+1 = 13

    每个特征图的大小= (13–3 + 2)/ 1 + 1 = 13

The output of this second convolution is again passed through a convolution layer where kernel size is again 3 x 3 and padding equal to 1 which means the output of this convolution layer generates feature maps of the same size of 13 x 13. But in this case, AlexNet uses 256 kernels so that means at the input of this convolution we have 384 channels which now get converted to 256 channels or we can say 256 feature maps are generated at the end of this convolution and every feature map is of size 13 x 13.

第二个卷积的输出再次通过一个卷积层,其中内核大小再次为3 x 3,填充等于1,这意味着该卷积层的输出将生成相同大小的13 x 13的特征图。但是在这种情况下,AlexNet使用256个内核,这意味着在该卷积的输入处,我们有384个通道现在已转换为256个通道,或者可以说在该卷积结束时生成了256个特征图,每个特征图的大小为13 x 13 。

计算: (Calculations:)

  • Size of Input: N = 13 x 13

    输入大小:N = 13 x 13

  • Size of Convolution Kernels: f = 3 x 3

    卷积核的大小:f = 3 x 3

  • No. of Kernels: 256

    仁数: 256

  • Strides: S = 1

    步幅:S = 1

  • Padding: P = 1

    填充:P = 1

  • Size of each feature map = (13–3+2)/1+1 = 13

    每个特征图的大小= (13–3 + 2)/ 1 + 1 = 13

Followed by the above is the next Overlapping Max Pool Layer, where the max-pooling is again done over a window of 3 x 3 and stride equal to 2 and that gives you the output feature maps and the number of channels remains same as 256 and the size of each feature map is 6 x 6.

紧随其后的是下一个重叠的最大池层,其中最大池化再次在3 x 3的窗口内完成,步幅等于2,这将为您提供输出要素图,并且通道数保持与256和每个要素图的大小为6 x 6。

计算: (Calculations:)

  • Size of Input: N = 13 x 13

    输入大小:N = 13 x 13

  • Size of Convolution Kernels: f = 3 x 3

    卷积核的大小:f = 3 x 3

  • Strides: S = 2

    步幅:S = 2

  • Padding: P = 0

    填充:P = 0

  • Size of each feature map = (13–3+0)/2+1 = 6

    每个特征图的大小= (13–3 + 0)/ 2 + 1 = 6

Now we have a fully connected layer which is the same as a multi-layer perception. The first two fully-connected layers have 4096 nodes each. After the above mentioned last max-pooling, we have a total of 6*6*256 i.e. 9216 nodes or features and each of these nodes is connected to each of the nodes in this fully-connected layer. So the number of connections we’ll have in this case is 9216*4096. And then every node from this fully connected convolution layer provides input to every node in the second fully connected layer. So here we’ll have a total of 4096*4096 connections as in the second fully connected layer also we have 4096 nodes.

现在我们有了一个完全连接的层,它与多层感知相同。 前两个完全连接的层各有4096个节点。 在上述最后一个最大池之后,我们总共有6 * 6 * 256,即9216个节点或特征,并且这些节点中的每一个都连接到此完全连接层中的每个节点。 因此,本例中的连接数为9216 * 4096。 然后,该完全连接的卷积层中的每个节点都会向第二个完全连接层中的每个节点提供输入。 因此,在这里,我们总共有4096 * 4096个连接,因为在第二个完全连接的层中,我们还有4096个节点。

And then, in the end, we have an output layer with 1000 softmax channels. Thus the number of connections between the second fully connected layer and the output layer is 4096*1000.

最后,我们有了一个包含1000个softmax通道的输出层。 因此,第二完全连接层和输出层之间的连接数为4096 * 1000。

Training on multiple GPUs

在多个GPU上训练

Original Image published in [AlexNet-2012]原始图片发表在[AlexNet-2012]

As we can see from the figure that inter-AlexNet was implemented in two channels because 1.2 million training examples were too big to fit on one GPU. So half of the network is put in one channel and the other half of the network is put into another channel. And as they are into two different channels so that made it possible to train this network on two different GPU cards. The GPU used was GTX 580 3GB GPUs and the network took between five to six days to get trained. Here cross-GPU parallelization (i.e. One GPU communicating with other GPU) is happening at some places like kernels of layer 3 take input from all kernel maps in layer 2. However, kernels in layer 4 take input only from those kernel maps in layer 3 which stay on the same GPU.

从图中可以看出,AlexNet的实现是通过两个渠道实现的,因为120万个训练示例太大而无法放在一个GPU上。 因此,网络的一半放在一个通道中,网络的另一半放在另一个通道中。 由于它们进入两个不同的通道,因此可以在两个不同的GPU卡上训练该网络。 使用的GPU是GTX 580 3GB GPU,并且网络花费了五到六天的时间进行培训。 这里在某些地方发生了跨GPU并行化 (即一个GPU与其他GPU通信),例如第3层的内核从第2层的所有内核映射获取输入。但是,第4层的内核仅从第3层的那些内核映射获取输入。保持在同一GPU上。

消失梯度问题 (Vanishing Gradient Problem)

If we use the non -linear activation function like sigmoidal or tan hyperbolic (tanh) function then it gives a risk of vanishing gradient that is in some cases when we are training the network with the gradient descent procedure vanishing gradient means that the gradient of the error function may become too small such that using that gradient when you try to update the network parameters, the update becomes almost negligible because the gradient itself is very small, that is what is vanishing gradient problem.

如果我们使用诸如S形或tan双曲线(tanh)函数之类的非线性激活函数,则可能会出现梯度消失的风险,在某些情况下,当我们使用梯度下降过程训练网络时,梯度消失会意味着错误函数可能变得太小,以致于当您尝试更新网络参数时使用该梯度时,由于梯度本身很小,即梯度问题消失了,因此更新几乎可以忽略不计。

Why does the Vanishing Gradient Problem arise??

为什么会出现消失梯度问题?

As we know with the graph of sigmoid function that if the value of an input is very high then it saturates to 1 and if the value of an input is too low then it saturates to 0. So when we try to take the gradient at these points the gradient is almost 0. The same thing is true if our non-linear activation function is tanh.

正如我们通过S型函数图所知道的,如果输入的值非常高,则其饱和度为1;如果输入的值太低,则其饱和度为0。因此,当我们尝试在这些位置采用梯度时表示梯度几乎为0。如果我们的非线性激活函数为tanh,则情况相同。

How to prevent Vanishing Gradient Problem??

如何防止消失梯度问题?

To prevent the vanishing gradient problem we use the relu (Rectified Linear Unit) activation function. As relu function is max(x, 0) so for x>0 the gradient is always constant so this is the advantage when we use relu as the non-linear activation function. Also, the training time using gradient descent with saturating nonlinearities like tanh and sigmoid is much larger as compared to non-saturating nonlinearity like relu. This can be seen in the following diagram where a four-layered convolution network with ReLUs (solid line) as an activation function reached a 25% training error rate on CIFAR10 dataset six times faster than the same network when ran with tanh (dashed line) as an activation function. Thus, network with relu as an activation function learns almost six times faster than saturating activation functions.

为了避免梯度消失的问题,我们使用relu(整流线性单元)激活函数。 由于relu函数是max(x,0),因此对于x> 0来说梯度总是恒定的,因此这在我们将relu用作非线性激活函数时具有优势。 同样,与非饱和非线性(如relu)相比,使用具有饱和非线性(如tanh和Sigmoid)的梯度下降的训练时间要大得多。 在下图中可以看到这一点,其中以ReLU(实线)作为激活函数的四层卷积网络在CIFAR10数据集上达到25%的训练错误率,是使用tanh运行时(同上)的同一个网络快六倍。作为激活功能。 因此,以relu作为激活功能的网络的学习速度几乎比饱和激活功能快六倍。

Original Image published in [AlexNet-2012]原始图片发表在[AlexNet-2012]

Problem using relu as an activation function

使用relu作为激活功能的问题

Unlike the sigmoidal and tanh activation function where the activation output is limited and bounded but in case of relu, the output is unbounded. As x increases the non-linear output of the activation function relu also increases. So to avoid this problem AlexNet tries to normalize the output of the convolution layer before applying the relu through a process known as Local Response Normalization (or LR Normalization).

与S型和tanh激活函数不同,后者的激活输出受到限制和限制,但是在relu的情况下,输出是不受限制的。 随着x增加,激活函数relu的非线性输出也增加。 因此,为避免此问题,AlexNet尝试通过称为Local 响应归一化(或LR归一化)的过程在应用relu之前归一化卷积层的输出

本地响应规范化 (Local Response Normalization)

Local Response Normalization is a type of normalization in which excited neurons are amplified while dampering the surrounding neurons at the same time in a local neighborhood. This particular operation is encouraged from a phenomenon known as lateral inhibition in neurobiology which indicates the capacity of a neuron to reduce the activity of its neighbors. So, the output of the convolution is first normalized before applying the non-linear activation function to limit the output values by unbounded activation function relu. It is a non-trainable layer in the network.

局部响应归一化是一种归一化类型,其中受激神经元被放大,同时在局部邻域中同时衰减周围的神经元。 神经生物学中的一种称为侧向抑制的现象鼓励了这种特殊的操作,该现象表明神经元减少其邻居活动的能力。 因此,在应用非线性激活函数以通过无界激活函数relu限制输出值之前,先对卷积的输出进行归一化。 它是网络中的不可训练层。

So this is how unbounded problem and vanishing gradient problem are prevented.

因此,这是如何防止无边界问题和消失的梯度问题的方法。

Local Response Normalization can be done across the channel and also it can be done within a channel. When we do this normalization across the channel then it is known as Inter-channel normalization and when the normalization occurs between the features of the same channel it is known as Intra-channel normalization. Inter-channel normalization was performed in the AlexNet network. Based on the neighborhood two types of LRN can be seen from the below figure:

本地响应规范化可以在整个通道上完成,也可以在通道内完成。 当我们跨通道执行此归一化时,则称为通道间归一化;而在同一通道的特征之间发生归一化时,则称为通道内归一化 。 通道间标准化是在AlexNet网络中执行的。 从下图可以看出,基于邻居的两种类型的LRN:

Image from the blog by Aqeel Anwar图片来自Aqeel Anwar的博客
  • Inter-channel normalization: Here normalization is performed across the channels and hence here neighborhood is the depth of the channel. The normalization output at the position (x,y) is given by the following formula:

    通道间归一化:此处归一化是在通道之间执行的,因此此处邻域是通道的深度。 位置(x,y)的归一化输出由以下公式给出:

Original Image published in [AlexNet-2012]原始图片发表在[AlexNet-2012]

Here, b[i, (x,y)] is the output at the location (x,y) in the ith channel and a[i, (x,y)] is the original value at location (x,y) in the ith channel. So in this, we normalize a[i, (x,y)] by a factor, and that factor is given by k plus alpha times sum of a[j, (x,y)] where this j varies between the neighboring channels. Thus j varies between a maximum of (0 and i-n/2) and a minimum of (N-1 and i+n/2) that means n/2 number of channels before and n/2 number of channels after i. 0 and N-1 are put to take care of the first and last channels. After this normalization, the output will be bounded and it can be shown with the subsequent figure.

这里,B [I,(X,Y)]是在该位置的输出(X,Y)中的第i个信道和[I,(X,Y)]是在位置为初始值(X,Y)第一个通道。 因此,在此,我们将a [i,(x,y)]归一化,该因子由ka [j,(x,y)]的 alpha乘和得出其中j在相邻通道之间变化。 因此, j最大值 ( 0in / 2)最小值 ( N-1i + n / 2)之间变化 ,这意味着i之前为 n / 2个通道, i 之后为n / 2个通道 。 0和N-1用于处理第一个和最后一个通道。 进行此归一化后,输出将受到限制,并可以在下图中显示。

Image from the blog by Aqeel Anwar图片来自Aqeel Anwar的博客
  • Intra-channel normalization: Here normalization occurs between the neighboring neurons across the surface of the same channel.

    通道内归一化:此处归一化发生在同一通道表面上的相邻神经元之间。

Image from the blog by Aqeel Anwar
图片来自Aqeel Anwa r 的博客

Here, b[k, (x,y)] is the output at the location (x,y) in the kth channel and a[k, (x,y)] is the original value at location (x,y) in the kth channel. So in this, we normalize a[k, (x,y)] by a factor, and that factor is given by k plus alpha times summation of the sum of squares of feature values within the neighborhood of x and y. Min and Max are put to take care of the features which are at the boundary of the feature maps. After this normalization, the output will be bounded and it can be shown with the subsequent figure.

此处, b [k,(x,y)]第k个通道中位置(x,y)的输出,而a [k,(x,y)]第k个通道中位置(x,y)的原始值第k个频道。 因此,在此,我们用一个因子对a [k,(x,y)]进行归一化,并且该因子由kalpha乘以xy附近的特征值平方和之和得出 最小和最大用于处理位于特征图边界的特征。 进行此归一化后,输出将受到限制,并可以在下图中显示。

Image from the blog by Aqeel Anwar图片来自Aqeel Anwar的博客

NOTE: ‘ k ‘ is used in the factor of both the normalization to avoid division by zero situation and here ‘k’ and ‘alpha’ are hyperparameters.

注意: “ k”用于规范化的因素,以避免被零除的情况,此处的“ k”“ alpha”超参数

过度拟合的问题 (Problem of Overfitting)

As there are 60 million parameters to be trained, so this would lead to the problem of overfitting which means the network would be able to learn or memorize the training data very properly but in case of testing input, it won’t be able to code the properties or features of the input data. So in such a case, the performance of the model in case of training may not be acceptable. So to reduce this overfitting issue additional augmented data was generated from the existing data and the augmentation(i.e. generation of new images from the original images by making variations like horizontal flipping, vertical flipping, zooming, etc. in the original images ) was done by mirroring and bt taking random crops from the input data. Another method by which the problem of overfitting was taken care of is by using dropout regularization.

由于有6000万个要训练的参数,因此这将导致过拟合的问题,这意味着网络将能够非常正确地学习或记忆训练数据,但是在测试输入的情况下,它将无法进行编码输入数据的属性或特征。 因此,在这种情况下,在训练情况下模型的性能可能无法接受。 因此,为了减少这种过拟合的问题,可以从现有数据中生成额外的增强数据,并通过以下方式进行增强(即通过对原始图像进行水平翻转,垂直翻转,缩放等变化来从原始图像生成新图像) 镜像和bt从输入数据中获取随机作物 。 解决过拟合问题的另一种方法是使用辍学正则化

什么是辍学正规化? (What is Dropout Regularization??)

Image from Packt Subscription图片来自Packt订阅

In this randomly selected neuron or randomly selected nodes, which are selected with a probability of 0.5 are dropped from the network temporarily. So the probability of the nodes that are removed and the probability of the nodes that are retained both will be equal to 0.5. So the dropout means that the node which has been dropped out will not pass its output to the subsequent nodes in the subsequent layers downstream and for the same nodes during backward propagation, no updation will take place as removing the nodes will also remove its subsequent connections as well. But this would increase the number of iterations required for training the model but this would make the model less vulnerable to overfitting thus, generalizing the model.

在该随机选择的神经元或随机选择的节点中,以0.5的概率选择的节点从网络中暂时删除。 因此,被删除的节点的概率和被保留的节点的概率都等于0.5。 因此,丢弃意味着已丢弃的节点不会将其输出传递到下游的后续层中的后续节点,并且对于反向传播期间的同一节点,不会进行更新,因为删除节点也会删除其后续连接也一样 但这会增加训练模型所需的迭代次数,但这会使模型不易过拟合,从而使模型泛化。

有关Alexnet架构的事实和数据 (Facts and figures regarding Alexnet architecture)

  1. The model was trained with 1.2 million images.该模型接受了120万张图像的训练。
  2. It has around 60 million parameters and 6,50,000 neurons.它具有约6000万个参数和6,50,000个神经元。
  3. Even after using 2 GPU, the network took approximately a week to train this network.即使使用了2个GPU,该网络也花费了大约一周的时间来训练该网络。
  4. As AlexNet takes 3 channels i.e. red, green, and blue into input so even if we input gray level image which is having just one channel so grayscale images need to be replicated into 3 channels R, G, and B so that it can be accepted by AlexNet.由于AlexNet将3个通道(即红色,绿色和蓝色)输入到输入中,因此即使我们输入的灰度图像只有一个通道,因此也需要将灰度图像复制到R,G和B的3个通道中由AlexNet。
  5. More number of computations are carried out in the earlier stage of the network and a large number of parameters are generated at the end of the network by fully connected layers(4096*4096).在网络的早期阶段执行了更多的计算,并且在网络的末端通过完全连接的层(4096 * 4096)生成了大量参数。
  6. For training, it uses Stochastic Gradient Descent with Momentum of 0.9, with a batch size of 128 examples, and weight decay of 0.0005为了进行训练,它使用动量为0.9的随机梯度下降,每批128个示例,重量衰减为0.0005
  7. The three convolution layers(1, 2, and 5) are followed by the max-pooling layer.三个卷积层(1、2和5)后面是最大合并层。
  8. The second, fourth and fifth convolution layers kernels are connected to only those kernels of the previous layer which are in the same GPU and the third convolution, and the fully connected layers are connected to all the neurons from the previous layers from both of the GPUs.第二,第四和第五卷积层内核仅连接到同一GPU和第三卷积中的上一层内核,并且完全连接的层连接到两个GPU中上一层的所有神经元。
  9. The weight(w) updation rule was as follows:

    权重( w )更新规则如下:

Original Image published in [AlexNet-2012]原始图片发表在[AlexNet-2012]

7. The initial weights to all the layers were assigned under Gaussian distribution with mean as 0 as standard deviation as 0.01. Also, the initial bias assigned was 1 to second, fourth, fifth convolution layers, and also to all the fully connected layers, but for all other layers, bias was assigned with 0.

7.在高斯分布下分配所有层的初始权重,平均值为0,标准差为0.01。 同样,分配给第二,第四,第五卷积层的初始偏置为1,也分配给所有完全连接的层,但是对于所有其他层,偏置分配为0。

使用 Keras的lexNet代码 (AlexNet code using Keras)

Here we are going to use the oxflower17 dataset prepared by Oxford and it’s present in the tflearn library.

在这里,我们将使用牛津大学编写的oxflower17数据集,该数据集存在于tflearn库中。

Also, one important thing to note here is that the image size present in tflearn library has the size of (224 x 224), so we’ll be using 224 x 224 image instaed of 227 x 227 which was used in the original architecture.

另外,这里要注意的一件事是tflearn库中存在的图像大小为(224 x 224),因此我们将使用224 x 224的insta图片,即227 x 227,这是在原始体系结构中使用的。

导入所需的库 (Importing the required libraries)

加载数据集 (Loading the dataset)

检查X和Y的形状 (Checking the shape of X and Y)

建立模型 (Creating the model)

模型总结 (Summary of the model)

  • There are approx 46 million trainable parameters here as can be seen from the subsequent image.从下图可以看出,这里大约有4600万个可训练参数。
Image by Author图片作者

编译模型 (Compiling the model)

训练模型 (Training the model)

alexnet 结构_AlexNet的体系结构和实现相关推荐

  1. Alexnet结构及代码

    AlexNet是2012年ImageNet竞赛冠军获得者Hinton和他的学生Alex Krizhevsky设计的.也是在那年之后,更多的更深的神经网路被提出,比如优秀的vgg,GoogleLeNet ...

  2. AlexNet结构详解(引用MrGiovanni博士)

    Reference. Krizhevsky A, Sutskever I, Hinton G E. ImageNet Classification with Deep Convolutional Ne ...

  3. 软件需求和结构_软件体系结构

    软件体系结构意指"软件的整体结构和这种结构为系统提供概念完整性的方式".从最简单的形式来看,体系结构是程序构件(模块)的结构或组织.这些构件交互的方式以及这些构件所用数据的结构.然 ...

  4. 从建筑结构到软件体系结构

    在业界,软件体系结构和建筑学的设计框架可以类比.如果把软件体系结构类比做建筑学的蓝图,那构件就可以比作一砖一瓦,或者更大概念如:庭院,花园等. 软件体系结构之所以可以独立于软件的数据结构和软件的算法, ...

  5. 14.1智能体的概念和结构: 反应式体系结构, 慎思式体系结构, 复合式体系结构

    本文内容为浙江工业大学王万良慕课课程的课程讲义, 将其整理为OneNote笔记同时添加了本人上课时的课堂笔记, 且主页中的思维导图就是根据课件内容整理而来, 为了方便大家和自己查看,特将此上传到CSD ...

  6. Alex-Net结构之LRP解析(局部响应规范化LRP 与 批规范化操作BN 对比)

    局部响应规范化(LRP,Local Response Normalization) 参数说明: bx,yi:b^i_{x,y}:bx,yi​:规范化后的值,iii是通道的位置,代表更新第几个通道的值, ...

  7. 使用TensorFlow 2.0+和Keras实现AlexNet CNN架构

    技术 (Technical) 介绍 (Introduction) The main content of this article will present how the AlexNet Convo ...

  8. seaborn 教程_使用Seaborn进行数据可视化教程

    seaborn 教程 "Seaborn makes the exploratory data analysis phase of your data science project beau ...

  9. pca 主成分分析_六分钟的主成分分析(PCA)的直观说明。

    pca 主成分分析 Principle Component Analysis (PCA) is arguably a very difficult-to-understand topic for be ...

最新文章

  1. html src 图片不显示图片,css中不用src也让图片显示的方法是什么?
  2. org/springframework/core/MethodClassKey
  3. 苹果核 - 页面动态化的基础 —— Tangram
  4. MYSQL数据库VALUES_MySQL数据库“十宗罪”(十大经典错误案例)
  5. 架构师最怕程序员知道的十件事
  6. liferay跳转页面
  7. 为Windows Server 2012 R2指定授权服务器
  8. c语言程序结果 856400,《C语言程序设计教程》习题参考解析1.doc
  9. 中jsp加载不出来layui_Maven+JSP+SSM+Mysql实现的学生选课系统
  10. 【路径规划】基于matlab A_star算法机器人动静态避障路径规划【含Matlab源码 371期】
  11. ALSA架构应用程序aplay及amixer调用关系(应用层到内核驱动)
  12. 阵列卡的全称叫磁盘阵列卡 是用来做 RAID
  13. 服务器跳过系统自检,win7 64位旗舰版跳过开机自检功能直接进入系统的方法
  14. 三分钟上马 ESP32 spiffs文件系统
  15. 【109期分享】4款毕业答辩论文PPT模板免费下载
  16. SAP合同类型的使用
  17. 重要性采样(Importance Sampling)详细学习笔记
  18. 几款好用的微信开发ui库
  19. 【Hack The Box】linux练习-- Tabby
  20. 有的js为什么放在body里面

热门文章

  1. Comparable和Comparator的区别
  2. Koa 还是 Express
  3. 代码规范(一)——java篇
  4. Oracle 该用户下所有的表
  5. 【转】回调函数,函数指针与函数对象
  6. Silverlight 应用 WCF RIA Services 在 IIS6 部署问题总结
  7. mescroll上拉加载的实现
  8. mpvue微信小程序http请求-fly.js
  9. 战双帕弥什显示服务器满员,战双帕弥什星火和信标服务器有何区别
  10. java exe jdk_javac.exe 和java.exe两个可执行程序放在JDK安装目录的( )目录下。_学小易找答案...