swap最大值和平均值

Blake Elias is a Researcher at the New England Complex Systems Institute.Shawn Jain is an AI Resident at Microsoft Research.

布莱克·埃里亚斯 ( Blake Elias) 是 新英格兰复杂系统研究所的研究员。 Shawn Jain 是 Microsoft Research 的 AI驻地 。

Our method, softmax-weighted average pooling (SWAP), applies average-pooling, but re-weights the inputs by the softmax of each window.

我们的方法softmax加权平均池(SWAP)应用平均池，但是通过每个窗口的softmax对输入进行加权。

We present a pooling method for convolutional neural networks as an alternative to max-pooling or average pooling. Our method, softmax-weighted average pooling (SWAP), applies average-pooling, but re-weights the inputs by the softmax of each window. While the forward-pass values are nearly identical to those of max-pooling, SWAP’s backward pass has the property that all elements in the window receive a gradient update, rather than just the maximum one. We hypothesize that these richer, more accurate gradients can improve the learning dynamics. Here, we instantiate this idea and investigate learning behavior on the CIFAR-10 dataset. We find that SWAP neither allows us to increase learning rate nor yields improved model performance.

我们提出了卷积神经网络的池化方法，以替代最大池化或平均池化。我们的方法softmax加权平均池(SWAP)应用平均池，但是通过每个窗口的softmax对输入进行加权。尽管前向传递值与最大池化值几乎相同，但是SWAP的向后传递具有以下属性：窗口中的所有元素均接收渐变更新，而不仅仅是最大更新。我们假设这些更丰富，更准确的渐变可以改善学习动力。在这里，我们实例化此想法并研究CIFAR-10数据集上的学习行为。我们发现SWAP既不能提高学习率，也不能提高模型性能。

起源 (Origins)

While watching James Martens’ lecture on optimization, from DeepMind / UCL’s Deep Learning course, we noted his point that as learning progresses, you must either lower the learning rate or increase batch size to ensure convergence. Either of these techniques results in a more accurate estimate of the gradient. This got us thinking about the need for accurate gradients. Separately, we had been doing an in-depth review of how backpropagation computes gradients for all types of layers. In doing this exercise for convolution and pooling, we noted that max-pooling only computes a gradient with respect to the maximum value in a window. This discards information — how can we make this better? Could we get a more accurate estimate of the gradient by using all the information?

在观看James Martens在 DeepMind / UCL的“深度学习”课程上的优化讲座时，我们注意到他的观点，即随着学习的进行，您必须降低学习率或增加批处理量以确保收敛。这些技术中的任何一种都会导致对梯度的更准确的估计。这使我们开始思考是否需要精确的渐变。另外，我们一直在深入研究反向传播如何计算所有类型图层的梯度。在进行卷积和池化练习时，我们注意到最大池化仅计算相对于窗口最大值的梯度。这会丢弃信息-我们如何才能使其变得更好？通过使用所有信息，我们能否获得更准确的梯度估计？

Max-pooling discards gradient information — how can we make this better?

最大池丢弃了梯度信息-我们如何使它变得更好？

进一步的背景 (Further Background)

Max-Pooling is typically used in CNNs for vision tasks as a downsampling method. For example, AlexNet used 3x3 Max-Pooling. [cite]

在CNN中，Max-Pooling通常作为下采样方法用于视觉任务。例如，AlexNet使用3x3 Max-Pooling。 [ 引用 ]

In vision applications, max-pooling takes a feature map as input, and outputs a smaller feature map. If the input image is 4x4, a 2x2 max-pooling operator with a stride of 2 (no overlap) will output a 2x2 feature map. The 2x2 kernel of the max-pooling operator has 2x2 non-overlapping ‘positions’ on the input feature map. For each position, the maximum value in the 2x2 window is selected as the value in the output feature map. The other values are discarded.

在视觉应用中，最大池化将要素图作为输入，并输出较小的要素图。如果输入图像为4x4，则跨度为2(无重叠)的2x2最大合并运算符将输出2x2特征图。 max-pooling运算符的2x2内核在输入要素图上具有2x2不重叠的“位置”。对于每个位置，选择2x2窗口中的最大值作为输出要素图中的值。其他值将被丢弃。

The implicit assumption is “bigger values are better,” — i.e. larger values are more important to the final output. This modelling decision is motivated by our intuition, although may not be absolutely correct. [Ed.: Maybe the other values matter as well! In a near-tie situation, maybe propagating gradients to the second-largest value could make it the largest value. This may change the trajectory the model takes as its learning. Updating the second-largest value as well, could be the better learning trajectory to follow.]

隐含的假设是“值越大越好”，即值越大对最终输出越重要。尽管并非完全正确，但此建模决策是出于我们的直觉。 [编辑：也许其他价值观也很重要！在接近平局的情况下，也许将梯度传播到第二大值可能会使它成为最大值。这可能会改变模型学习的轨迹。同样，更新第二大的值可能也是更好的学习轨迹。]

You might be wondering, is this differentiable? After all, deep learning requires that all operations in the model be differentiable, in order to compute gradients. In the purely mathematical sense, this is not a differentiable operation. In practice, in the backward pass, all positions corresponding to the maximum simply copy the inbound gradients; all the non-maximum positions simply set their gradients to zero. PyTorch implements this as a custom CUDA kernel (this function invokes this function).

您可能想知道，这与众不同吗？毕竟，深度学习要求模型中的所有运算都是可微的，以便计算梯度。从纯粹的数学意义上讲，这不是微分运算。实际上，在向后遍历中，所有与最大值对应的位置都只是复制入站渐变；所有非最大位置只需将其梯度设置为零即可。 PyTorch将其实现为自定义CUDA内核( 此函数调用此函数 )。

In other words, Max-Pooling generates sparse gradients. And it works! From AlexNet [cite] to ResNet [cite] to Reinforcement Learning [cite cite], it’s widely used.

换句话说，Max-Pooling生成稀疏渐变。而且有效！从AlexNet [ 引用 ]到RESNET [ 引用 ]以强化学习[ 举举 ]，它的广泛应用。

Many variants have been developed; Average-Pooling outputs the average, instead of the max, over the window. Dilated Max-Pooling makes the window non-contiguous; instead, it uses a checkerboard like pattern.

已经开发了许多变体。平均池输出窗口上的平均值而不是最大值。扩展的最大池使窗口不连续；相反，它使用棋盘状图案。

arXiv (via arXiv (通过StackOverflow).StackOverflow )。

Controversially, Geoff Hinton doesn’t like Max-Pooling:

有争议的是，Geoff Hinton不喜欢Max-Pooling：

The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster.

卷积神经网络中使用的池化操作是一个很大的错误，它运行良好的事实是一场灾难。

If the pools do not overlap, pooling loses valuable information about where things are. We need this information to detect precise relationships between the parts of an object. Its [sic] true that if the pools overlap enough, the positions of features will be accurately preserved by “coarse coding” (see my paper on “distributed representations” in 1986 for an explanation of this effect). But I no longer believe that coarse coding is the best way to represent the poses of objects relative to the viewer (by pose I mean position, orientation, and scale).

如果池不重叠，则池将丢失有关事物位置的有价值的信息。我们需要此信息来检测对象各部分之间的精确关系。它的[ 原文 ]诚然，如果池重叠足够的特征的位置将被准确地“粗编码”保存(见我于1986年“分布式表示”纸的这种效应的解释)。但是我不再相信粗略编码是代表对象相对于观察者的姿态的最佳方法(所谓姿态，是指位置，方向和比例)。

[Source: Geoff Hinton on Reddit.]

[来源：杰夫欣顿上书签交易。]

动机 (Motivation)

Max-Pooling generates sparse gradients. With better gradient estimates, could we take larger steps by increasing learning rate, and therefore converge faster?

Max-Pooling生成稀疏渐变。有了更好的梯度估计，我们是否可以通过提高学习率来采取更大的步骤，从而收敛得更快？

Sparse gradients discard too much information. With better gradient estimates, could we take larger steps by increasing learning rate, and therefore converge faster?

稀疏的梯度会丢弃过多的信息。有了更好的梯度估计，我们是否可以通过提高学习率来采取更大的步骤，从而收敛得更快？

Although the outbound gradients generated by Max-Pool are sparse, this operation is typically used in a Conv → Max-Pool chain of operations. Notice that the trainable parameters (i.e., the filter values, F) are all in the Conv operator. Note also, that:

尽管Max-Pool生成的出站渐变稀疏，但此操作通常在Conv→Max-Pool操作链中使用。注意，可训练参数(即过滤器值F )都在Conv运算符中。另请注意：

dL/dF = Conv(X, dL/dO), where:

dL / dF = Conv(X，dL / dO) ，其中：

dL/dF are the gradients with respect to the convolutional filter

dL / dF是相对于卷积滤波器的梯度
dL/dO is the outbound gradient from Max-Pool, and

dL / dO是Max-Pool的出站梯度，并且
X is the input to Conv (forward).

X是Conv(正向)的输入。

As a result, all positions in the convolutional filter F get gradients. However, those gradients are computed from a sparse matrix dL/dO instead of a dense matrix. (The degree of sparsity depends on the Max-Pool window size.)

结果，卷积滤波器F中的所有位置都得到梯度。但是，这些梯度是根据稀疏矩阵dL / dO而不是密集矩阵计算的。 (稀疏程度取决于最大池窗口的大小。)

Forward:

前锋：

Backward:

落后：

Figure 3: Max pooling generates sparse gradients. (Authors’ image)

Note also that dL/dF is not sparse, as each sparse entry of dL/dO sends a gradient value back to all entries dL/dF.

还要注意， dL / dF 不是稀疏的，因为dL / dO的每个稀疏条目都会将梯度值发送回所有条目dL / dF 。

But this raises a question. While dL/dF is not sparse itself, its entries are calculated based on an averaging of sparse inputs. If its inputs (dL/dO — the outbound gradient of Max-Pool) — were dense, could dL/dF be a better estimate of the true gradient? How can we make dL/dO dense while still retaining the “bigger values are better” assumption of Max-Pool?

但这提出了一个问题。尽管dL / dF 本身不是稀疏的，但其条目是根据稀疏输入的平均值计算得出的。如果其输入( dL / dO -Max-Pool的出站梯度)很密集，那么dL / dF是否可以更好地估算真实梯度？我们如何才能使dL / dO致密，同时仍保留Max-Pool的“越大越好”的假设？

One solution is Average-Pooling. There, all activations pass a gradient backwards, rather than just the max in each window. However, it violates MaxPool’s assumption that “bigger values are better.”

一种解决方案是平均池。在那里，所有激活都向后传递渐变，而不仅仅是每个窗口中的最大值。但是，它违反了MaxPool的假设，即“值越大越好”。

Enter Softmax-Weighted Average-Pooling (SWAP). The forward pass is best explained as pseudo-code:

输入Softmax加权平均池(SWAP)。最好将前向传递解释为伪代码：

average_pool(O, weights=softmax_per_window(O))

average_pool(O，权重= softmax_per_window(O))

Figure 4: SWAP produces a value almost the same as max-pooling — but passes gradients back to all entries in the window. (Authors’ image)

The softmax operator normalizes the values into a probability distribution, however, it heavily favors large values. This gives it a max-pool like effect.

softmax运算符将这些值归一化为概率分布，但是，它非常喜欢较大的值。这给了它一个类似最大池的效果。

On the backward pass, dL/dO is dense, because each outbound activation in A depends on all activations in its window — not just the max value. Non-max values in O now receive relatively small, but non-zero, gradients. Bingo!

在向后传递时， dL / dO很密集，因为A中的每个出站激活都取决于其窗口中的所有激活，而不仅仅是最大值。现在， O中的非最大值会收到相对较小但非零的渐变。 答对了！

实验装置 (Experimental Setup)

We conducted our experiments on CIFAR10. Our code is available here. We fixed the architecture of the network to:

我们在CIFAR10上进行了实验。我们的代码在这里。我们将网络架构固定为：

We tested three different variants of the “Pool” layer: two baselines (Max-Pool and Average-Pool), in addition to SWAP. Models were trained for 100 epochs using SGD, LR=1e-3 (unless otherwise mentioned).

我们测试了“ Pool”层的三种不同变体：除SWAP之外，还提供了两个基准(Max-Pool和Average-Pool)。使用SGD，LR = 1e-3(除非另有说明)为模型训练了100个时期。

We also trained SWAP with a {25, 50, 400}% increase in LR. This was to test the idea that, with more accurate gradients we could take larger steps, and with larger steps the model would converge faster.

我们还对SWAP进行了培训，使LR增加了{25、50、400}％。这是为了检验这样的想法：使用更准确的渐变，我们可以采用更大的步长，而采用更大的步长，模型可以收敛得更快。

结果 (Results)

讨论区 (Discussion)

SWAP shows worse performance compared to both baselines. We do not understand why this is the case. An increase in LR provided no benefit; generally, worse performance vs baseline was observed as LR increased. We attribute the 400% increase in LR performing better than the 50% increase to randomness; we tested with only a single random seed and reported only a single trial. Another possible explanation for the 400% increase performing better, is simply the ability to “cover more ground” with a higher LR.

与两个基准相比，SWAP的性能均较差。我们不明白为什么会这样。 LR的增加无济于事；通常，随着LR的增加，与基线相比，性能较差。我们将LR表现的400％增长优于50％的增长归因于随机性。我们仅使用一个随机种子进行了测试，并且仅报告了一项试验。对于400％的更高性能表现的另一个可能的解释是，具有更高LR的“覆盖更多地面”的能力。

An increase in LR provided no benefit; generally, worse performance vs baseline was observed as LR increased.

LR的增加无济于事；通常，随着LR的增加，与基线相比，性能较差。

未来的工作和结论 (Future Work and Conclusion)

While SWAP did not show improvement, we still want to try several experiments:

尽管SWAP并未显示出改善，但我们仍想尝试几个实验：

Overlapping pool windows. One possibility is to use overlapping pool windows (i.e. stride = 1), rather than the disjoint windows we used here (with stride = 2). Modern convolutional architectures, like AlexNet and ResNet both use overlapping pool windows. So, for a fair comparison, it would be sensible to compare with something closer to state-of-the-art, rather than the architecture we used here for simplicity. Indeed, Hinton’s critique of max-pooling is most stringent in the case of non-overlapping pool windows, with the reasoning that this throws out spatial information.

重叠的游泳池窗户。 一种可能是使用重叠的池窗口(即，步幅= 1)，而不是我们在此使用的不相交的窗口(步幅= 2)。诸如AlexNet和ResNet之类的现代卷积体系结构都使用重叠的池窗口。因此，为了进行公平的比较，比较接近最新技术的东西而不是我们为简单起见而使用的体系结构比较是明智的。确实，对于非重叠的泳池窗户，欣顿对最大泳池的批评最为严格，理由是这会抛出空间信息。
Histogram of activations. We would like to try Max-Pool & SWAP with the exact same initialization, train both, and compare the distributions of gradients. Investigating the difference in gradients may offer a better understanding of the stark contrast in training behavior.

激活的直方图。 我们想尝试使用完全相同的初始化程序来进行Max-Pool和SWAP训练，同时训练它们和比较梯度分布。研究梯度差异可以更好地理解训练行为中的鲜明对比。

Improving gradient accuracy is still an exciting area. How else can we modify the model or the gradient computation to improve gradient accuracy?

改善梯度精度仍然是令人兴奋的领域。我们还能如何修改模型或梯度计算以提高梯度精度？

翻译自: https://towardsdatascience.com/swap-softmax-weighted-average-pooling-70977a69791b

swap最大值和平均值

查看全文

http://www.taodudu.cc/news/show-863879.html

pytorch卷积可视化_使用Pytorch可视化卷积神经网络
u-net语义分割_使用U-Net的语义分割
地理空间数据
嵌入式系统分类及其应用场景_词嵌入及其应用简介
hotelling变换_基于Hotelling-T²的偏最小二乘（PLS）中的变量选择
命名实体识别实体抽取_您的公司为什么要关心命名实体的识别
机器学习异常值检测_异常值是否会破坏您的机器学习预测？寻找最佳解决方案
yolov3算法优点缺点_优点缺点
主成分分析具体解释_主成分分析-现在用您自己的术语解释
netflix 数据科学家_数据科学和机器学习在Netflix中的应用
python画交互式地图_使用Python构建交互式地图-入门指南
大疆机器学习实习生_我们的数据科学机器人实习生
ai人工智能的本质和未来_人工智能的未来在于模型压缩
tableau使用_使用Tableau探索墨尔本房地产市场
谷歌云请更正这张卡片的信息_如何识别和更正Google Analytics（分析）报告中的（未设置）值
科技情报研究所工资_我们所说的情报是什么？
手语识别_使用深度学习进行手语识别
数据科学的5种基本的面向业务的批判性思维技能
大数据技术学习之旅_数据-数据科学之旅的起点
编写分段函数子函数_编写自己的函数
打破学习的玻璃墙_打破Google背后的创新深度学习
向量矩阵张量_张量，矩阵和向量有什么区别？
monk js_使用Monk AI进行手语分类
辍学的名人_辍学效果如此出色的5个观点
强化学习-动态规划_强化学习-第5部分
查看-增强会话_会话式人工智能-关键技术和挑战-第2部分
我从未看过荒原写作背景_您从未听说过的最佳数据科学认证
nlp算法文本向量化_NLP中的标记化算法概述
数据科学与大数据排名思考题_排名前5位的数据科学课程
《成为一名机器学习工程师》_如何在2020年成为机器学习工程师

swap最大值和平均值_SWAP：Softmax加权平均池相关推荐

编写一个汇编语言程序，完成以下要求。从BUF单元处定义有10个带符号字数据:-1,3,24,94,62,72,55,0,-48,99，试找出他们中的最大值和平均值，并以此分别存放至该数据区的后两个单元
编写一个汇编语言程序,完成以下要求.从BUF单元处定义有10个带符号字数据:-1,3,24,94,62,72,55,0,-48,99,试找出他们中的最大值和平均值,并以此分别存放至该数据区的后两个单元 ...
【软件工程】求10个数最大值和平均值以流程图、盒图、PRD图的形式画出
求10个数最大值和平均值以程序流程图.盒图.PRD图的形式画出程序流程图: 盒图: PRD图:
python list 平均值_python list 最大值和平均值Python基础 - 文件拷贝
最近在备份手机上的照片的时候,纯手工操作觉得有些麻烦,就想写个脚本自动进行.因为备份的时候有些照片以前备份过了,所以需要有个判重操作. 主要功能在copyFiles()函数里实现,如下: 电脑技术网认 ...
c#输出最大值、最小值和平均值（B）【C#】
c#输出最大值.最小值和平均值(B) 题目描述使用C#编写一个控制台应用.输入若干个正整数存入数组中(输入exit表示输入结束),输出最大值.最小值和平均值输入输入若干个正整数存入数组中输出 ...
c#输出最大值、最小值和平均值(A)【C#】
c#输出最大值.最小值和平均值(A) 题目描述使用C#编写一个控制台应用.输入10个正整数存入数组中,输出最大值.最小值和平均值输入输入10个正整数输出最大值.最小值和平均值样例输入 1 ...
6-4 求一组数中的最大值、最小值和平均值
6-4 求一组数中的最大值.最小值和平均值编写函数,求一组数中的最大值.最小值和平均值. 函数接口定义: float fun(int a[],int n,int *max,int *min); 其中 ...
mysql 获取最大的平均数_mysql怎么求最大值、最小值和平均值？
在mysql中,可以分别使用MAX()函数.MIN()函数和AVG()函数来求最大值.最小值和平均值.MAX()和MIN()函数可以返回指定列中的最大值和最小值:AVG()函数通过计算返回的行数和每一 ...
6-7 求一组数中的最大值、最小值和平均值 (10 分)
编写函数,求一组数中的最大值.最小值和平均值. 函数接口定义: float fun(int a[],int n,int *max,int *min); 其中 a.n.max 和 min 都是用户传入的 ...
Java黑皮书课后题第5章：*5.1（统计正数和负数的个数然后计算这些数的平均值）编写程序，读入未指定个数的整数，判断读入的正数有多少个、负数有多少个，然后计算输入值的总和和平均值（不记0，浮点表示）
*5.1(统计正数和负数的个数然后计算这些数的平均值)编写程序,读入未指定个数的整数,判断读入的正数有多少个.负数有多少个,然后计算输入值的总和和平均值(不记0,平均值使用浮点表示) 题目题目概述 ...

swap最大值和平均值_SWAP：Softmax加权平均池