Coursera | Andrew Ng (02-week-1-1.7)

该系列仅在原课程基础上部分知识点添加个人学习笔记，或相关推导补充等。如有错误，还请批评指教。在学习了 Andrew Ng 课程的基础上，为了更方便的查阅复习，将其整理成文字。因本人一直在学习英语，所以该系列以英文为主，同时也建议读者以英文为主，中文辅助，以便后期进阶时，为学习相关领域的学术论文做铺垫。- ZJ

Coursera 课程 |deeplearning.ai |网易云课堂

转载请注明作者和出处：ZJ 微信公众号-「SelfImprovementLab」

知乎：https://zhuanlan.zhihu.com/c_147249273

CSDN：http://blog.csdn.net/junjun_zhao/article/details/79071033

1.7 Understanding dropout (理解 dropout )

(字幕来源：网易云课堂)

dropout does this seemingly crazy thing of randomly knocking out units on your network.Why does it work so well with a regularizer?Let’s gain some better intuition.In the previous video,I gave this intuition that drop-out randomly knocks out units in your network.So it’s as if on every iteration, you’re working with a smaller neural network,and so using a smaller neural network seems like it should have a regularizing effect.

Deopout 可以随机删除网络中的神经单元做法有点疯狂，它为什么可以通过正则化发挥这么大作用呢，我们来更直观地理解一下，上节课，我们已经对 dropout 随机删除网络中的神经单元有了一个直观了解，好像每次迭代之后神经网络都会变得比以前更小，因此采用一个较小神经网络好像和使用正则化的效果是一样的。

Here’s a second intuition which is,let’s look at it from the perspective of a single unit.Let’s say this one.Now, for this unit to do its job as for inputs and it needs to generate some meaningful output.Now with dropout ,the inputs can get randomly eliminated.Sometimes those two units will get eliminated,sometimes a different unit will get eliminated.So, what this means is that this unit, which I’m circling in purple,it can’t rely on any one feature,**because any one feature could go away at random,or any one of its own inputs could go away at random.Some particular would be reluctant to put all of its bets on, say, just this input, right?The weights, we’re reluctant to put too much weight on any one input,because it can go away.So this unit will be more motivated to spread out this way and give you a little bit of weight to each of the four inputs to this unit.And by spreading all the weights,this will tend to have an effect of **shrinking the squared norm of the weights.And so, similar to what we saw with L2 regularization,the effect of implementing dropout is that it shrinks the weights,and does some of those outer regularization that helps prevent over-fitting.But it turns out that dropout can formally be shown to be in an adaptive form without a regularization.But L2 penalty on different weights are different,depending on the size of the activations being multiplied that way.

第二个直观认识是，我们从单个神经元入手，如图，这个单元的工作就是输入并生成一些有意义的输出，通过 dropout ，该单元的输入几乎被消除，有时这两个单元会被删除，有时会删除其它单元，就是说我用紫色圈起来的这个单元，它不能依靠任何特征，因为特征都有可能被随机清除，或者说该单元的输入也都可能被随机清除，我不愿意把所有赌注都放在一个节点上，不愿意给任何一个输入加上太多权重，因为它可能会被删除，因此该单元将通过这种方式积极地传播开，并为单元的四个输入增加一点权重，通过传播所有权重， dropout 将产生收缩权重的平方范数的效果，和我们之前讲过的 L2 正则化类似，实施 dropout 的结果是它会压缩权重，并完成一些预防过拟合的外层正则化，事实证明 dropout 被正式地作为一种正则化的替代形式， L2 对不同权重的衰减是不同的，它取决于倍增的激活函数的大小。

But to summarize,it is possible to show that dropout has a similar effect to L2 regularization.Only to L2 regularization applied to different ways can be a little bit differentand even more adaptive to the scale of different inputs.One more detail for when you’re implementing dropout .Here’s a network where you have three input features.This is seven hidden units here,seven, three, two, one.So, one of the parameters we had to choose was the keep_prob which has a chance of keeping a unit in each layer.So, it is also feasible to vary keep_prob by layer.So for the first layer,your matrix W[1]W^{[1]} will be three by seven.Your second weight matrix will be seven by seven.W[3]W^{[3]} will be seven by three and so on.And so W[2]W^{[2]} is actually the biggest weight matrix, right because they’re actually the largest set of parameters would be in W[2]W^{[2]}, which is seven by seven.So to prevent, to reduce over-fitting of that matrix,maybe for this layer,I guess this is layer two,you might have a keep_prob that’s relatively low,say zero point five,whereas for different layers where you might worry less about over-fitting,you could have a higher keep_prob,maybe just zero point seven, maybe this is a 0.7 And if a layers we don’t worry about over-fitting at all,you can have a keep_prob of one point zero.

总结一下， dropout 的功能类似于 L2 正则化，与 L2 正则化不同的是被应用的方式不同 dropout 也会有所不同 ，甚至更适用于不同的输入范围，实施 dropout 的另一个细节是，这是一个拥有三个输入特征的网络，这是它的 7 个隐藏单元，7 个 3 个 2 个和1 个，其中一个要选择的参数是 keep-prob ，它代表每一层上保留单元的概率，所以不同层的 keep-prob 也可以变化，第一层，矩阵W[1]W^{[1]}是 3x7，第二个权重矩阵是7x7，W[3]W^{[3]}是 7x3 以此类推，W[2]W^{[2]}是最大的权重矩阵，因为W[2]W^{[2]}拥有最大参数集即 7x7，为了预防矩阵的过拟合，对于这一层，我认为这是第二层，它的 keep-prob 值应该相对较低，假设是 0.5，对于其它层过拟合的程度可能没那么严重，它们的 keep-prob 值可能高一些，可能是0.7 这里是 0.7，如果在某一层我们不必担心其过拟合的问题，那么 keep-prob 可以为 1。

So, for clarity, these are numbers I’m drawing on the purple boxes.These could be different keep_probs for different layers.Notice that the keep_prob of one point zero means that you’re keeping every unit and so,you’re really not using dropout for that layer.But for layers where you’re more worried about over-fitting,really the layers with a lot of parameters,you can set keep_prob to be smaller to apply a more powerful form of dropout .It’s kind of like cranking upthe regularization parameter lambda for L2 regularizationwhere you try to regularize some layers more than others.And technically, you can also apply dropout to the input layer,where you can have some chance of just maxing out one or more of the input features.Although in practice, usually don’t do that that often.And so, keep_prob of one point zero is quite common for the input there.You can also use a very high value, maybe zero point nine,but it’s much less likely that you want to eliminate half of the input features.So usually keep_prob,if you apply the lawwould be a number close to one, if you even apply dropout at all to the input layer.So just to summarize,if you’re more worried about some layers overfitting than others,you can set a lower keep_prob for some layers than others.The downside is, this gives you even more hyper parameters to search for using cross-validation.One other alternative might be to have some layers where you apply dropout and some layers where you don’t apply dropout and then just have one hyper parameter,which is the keep_prob for the layers for which you do apply dropout .

为了表达清楚我用紫线笔把它们圈出来，每层 keep-prob 的值都可能不同，注意 keep-prob 的值是 1 意味着保留所有单元，并且不在这一层使用 dropout ，对于有可能出现过拟合，且含有诸多参数的层，我们可以把 keep.prob 设置成比较小的值以便应用更强大的 dropout ，有点像在处理， L2 正则化的正则化参数 λ，我们尝试对某些层施行更多正则化。从技术上讲我们也可以对输入层应用 dropout ，我们有机会删除一个或多个输入特征，虽然现实中我们通常不这么做， keep-prob 的值为 1 是非常常用的输入值，也可以用更大的值或许是 1, 0.9，但是消除一半的输入特征是不太可能的，如果我们遵守这个准则， keep-prob 的值，会接近于 1 即使你对输入层应用 dropout 。总结一下，如果你担心某些层比其它层更容易发生过拟合，可以把某些层的 keep-prob 值设置得比其它层更低，缺点是为了使用交叉验证你要搜索更多的超级参数，另一种方案是在一些层上应用 dropout ，而有些层不用 dropout ，应用 dropout 的层只含有一个超级参数，就是 keep-prob 。

And before we wrap up, just a couple implementational tips.Many of the first successful implementations of dropout s were to computer vision.So in computer vision, the input size is so big,you inputting all these pixels that you almost never have enough data.And so dropout is very frequently used by computer vision.And there’s some computer vision researchers that pretty much always use it,almost as a default.But really the thing to remember is that dropout is a regularization technique,it helps prevent over-fitting.And so, unless my algorithm is over-fitting,I wouldn’t actually bother to use dropout .So it’s used somewhat less often than other application areas.There’s just with computer vision,you usually just don’t have enough data,so you’re almost always overfitting,which is why there tends to be some computer vision researchers who swear by dropout .by the intuition, I was doesn’t always generalize I think to other disciplines.

结束前分享两个实施过程中的技巧，实施 dropout 在计算机视觉领域有很多成功的第一次，计算视觉中的输入量非常大，输入了太多像素以至于没有足够的数据，所以 dropout 在计算机视觉中应用得比较频繁，有些计算机视觉研究人员非常喜欢用它，几乎成了默认的选择，但是要牢记一点 dropout 是一种正则化方法，它有助于预防过拟合，因此除非算法过拟合，不然我是不会使用 dropout 的，所以它在其它领域应用得比较少，主要存在于计算视觉领域，因为我们通常没有足够的数据，所以一直存在过拟合，这就是有些计算视觉研究人员如此钟情 dropout 函数的原因，直观上我认为不能概括其它学科。

One big downside of dropout is that the cost function J is no longer well-defined.On every iteration, you are randomly killing off a bunch of nodes.and so, if you are double checking the performance of gradient dissent,it’s actually harder to double check that right, you have a well-defined cost function J that is going downhill on every iteration.Because the cost function J that you’re optimizing is actually less,less well-defined, or is certainly hard to calculate.So you lose this debugging tool to will a plot,a graph like this.So what I usually do is turn off dropout ,you will set key prop equals one,and I run my code and make sure that it is monotonically decreasing J,and then turn on dropout and hope thatI didn’t introduce bugs into my code during dropout .Because you need other ways, I guess,but not plotting these figures to make sure that your code is working to greatness and it’s working even with dropout .So with that, there’s still a few more regularization techniques that are worth your knowing.Let’s talk about a few more such techniques in the next video.

Dropout 一大缺点就是代价函数 J 不再被明确定义，每次迭代都会随机移除一些节点，如果再三检查梯度下降的性能，实际上是很难进行复查的，定义明确的代价函数 J 每次迭代后都会下降，因为我们所优化的代价函数J实际上并没有明确定义，或者在某种程度上很难计算，所以我们失去了调试工具，来绘制这样的图片，我通常会关闭 dropout 函数，将 keep.prop 的值设为 1，运行代码确保 J 函数单调递减，然后再打开 dropout 函数，在 dropout 过程中代码并未引入bug，我觉得你也可以尝试其它方法，虽然我们并没有关于这些方法性能的数据统计，但是你可以把它们与 dropout 方法一起使用，所以值得大家去学习的正则化方法并不止这一个，我们下节课再讲。

重点总结：

理解 Dropout

另外一种对于 Dropout 的理解。

这里我们以单个神经元入手，单个神经元的工作就是接收输入，并产生一些有意义的输出，但是加入了 Dropout 以后，输入的特征都是有可能会被随机清除的，所以该神经元不会再特别依赖于任何一个输入特征，也就是说不会给任何一个输入设置太大的权重。

所以通过传播过程，dropout 将产生和 L2 范数相同的收缩权重的效果。

对于不同的层，设置的keep_prob也不同，一般来说神经元较少的层，会设 keep_prob
=1.0，神经元多的层，则会将keep_prob设置的较小。

缺点：

dropout 的一大缺点就是其使得 Cost function不能再被明确的定义，以为每次迭代都会随机消除一些神经元结点，所以我们无法绘制出每次迭代 J(W,b)J(W,b)下降的图，如下：

使用 Dropout：

关闭 dropout 功能，即设置 keep_prob = 1.0；
运行代码，确保 J(W，b)J(W，b) 函数单调递减；
再打开 dropout 函数。

参考文献：

[1]. 大树先生.吴恩达Coursera深度学习课程 DeepLearning.ai 提炼笔记（2-1）– 深度学习的实践方面

PS: 欢迎扫码关注公众号：「SelfImprovementLab」！专注「深度学习」，「机器学习」，「人工智能」。以及「早起」，「阅读」，「运动」，「英语」「其他」不定期建群打卡互助活动。

Coursera | Andrew Ng (02-week-1-1.7)—理解 Dropout相关推荐

【原】Coursera—Andrew Ng机器学习—Week 9 习题—异常检测
[原]Coursera-Andrew Ng机器学习-Week 9 习题-异常检测参考文章: (1)[原]Coursera-Andrew Ng机器学习-Week 9 习题-异常检测 (2)https: ...
Coursera | Andrew Ng (01-week-3-3.8)—激活函数的导数
该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等.如有错误,还请批评指教.在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字.因本人一直在学习英语,所以 ...
Coursera | Andrew Ng (01-week-2-2.6)—更多导数的例子
该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等.如有错误,还请批评指教.在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字.因本人一直在学习英语,所以 ...
Coursera | Andrew Ng (01-week-2-2.5)—导数
该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等.如有错误,还请批评指教.在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字.因本人一直在学习英语,所以 ...
Coursera | Andrew Ng (02-week-1-1.12)—梯度的数值逼近
该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等.如有错误,还请批评指教.在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字.因本人一直在学习英语,所以 ...
Coursera | Andrew Ng (01-week-2-2.17)—Jupyter _ ipython 笔记本的快速指南
该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等.如有错误,还请批评指教.在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字.因本人一直在学习英语,所以 ...
Coursera | Andrew Ng (01-week-1-1.2)—什么是神经网络？
什么是神经网络? 该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等.如有错误,还请批评指教.在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字.因本人一 ...
Coursera | Andrew Ng (01-week-1-1.2)—What is a Neural Network?
什么是神经网络? 该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等.如有错误,还请批评指教.在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字.因本人一 ...
Coursera | Andrew Ng (01-week-1-1.3)—用神经网络进行监督学习
该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等.如有错误,还请批评指教.在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字.因本人一直在学习英语,所以 ...

Coursera | Andrew Ng (02-week-1-1.7)—理解 Dropout

重点总结：

理解 Dropout

Coursera | Andrew Ng (02-week-1-1.7)—理解 Dropout相关推荐

最新文章

热门文章