Coursera | Andrew Ng (02-week-3-3.1)

该系列仅在原课程基础上部分知识点添加个人学习笔记，或相关推导补充等。如有错误，还请批评指教。在学习了 Andrew Ng 课程的基础上，为了更方便的查阅复习，将其整理成文字。因本人一直在学习英语，所以该系列以英文为主，同时也建议读者以英文为主，中文辅助，以便后期进阶时，为学习相关领域的学术论文做铺垫。- ZJ

Coursera 课程 |deeplearning.ai |网易云课堂

转载请注明作者和出处：ZJ 微信公众号-「SelfImprovementLab」

知乎：https://zhuanlan.zhihu.com/c_147249273

CSDN：http://blog.csdn.net/junjun_zhao/article/details/79115114

3.1 Tuning process (调参处理)

(字幕来源：网易云课堂)

Hi, and welcome back.You’ve seen by now that changing neural nets can involve setting a lot of different hyperparameters.Now, how do you go about finding a good setting for these hyperparameters?In this video, I want to share with you some guidelines,some tips for how to systematically organize your hyperparameter tuning process,which hopefully will make it more efficient for you to converge on a good setting of the hyperparameters.One of the painful things about training deepness is the sheer number of hyperparameters you have to deal with,ranging from the learning rate alpha to the momentum term beta,if using momentum ,or the hyperparameters for the Adam Optimization Algorithm which are beta one, beta two, and epsilon.Maybe you have to pick the number of layers,maybe you have to pick the number of hidden units for the different layers,and maybe you want to use learning rate decay,so you don’t just use a single learning rate alpha.And then of course,you might need to choose the mini-batch size.

大家好,欢迎回来。目前为止，你已经了解到神经网络的改变会涉及到许多不同超参数的设置。现在，对于超参数而言，你要如何找到一套好的设定呢？在此视频中我想和你分享一些指导原则，一些关于如何系统地组织超参调试过程的技巧。希望这些能够让你更有效的聚焦到合适的超参设定中。关于训练深度最难的事情之一是你要处理的参数的数量，从学习速率 α α \alpha 到 momentum （术语）β，如果使用 momentum 或 Adam 优化算法的参数，即 β1 β2 和 ε，也许你还得选择层数，也许你还得选择不同层中隐藏单元的数量，也许你还想使用学习率衰退，所以你使用的不是单一的学习速率 α α \alpha ，接着,当然，你可能还需要选择 mini-batch 的大小。

So it turns out, some of these hyperparameters are more important than others.The most learning applications I would say,alpha,the learning rate is the most important hyperparameter to tune.Other than alpha, a few other hyperparameters I tend to would maybe tune next,would be maybe the momentum term,say, 0.9 is a good default.I’d also tune the mini-batch size to make sure that the optimization algorithm is running efficiently.Often I also fiddle around with the hidden units.Of the ones I’ve circled in orange,these are really the three that I would consider second in importance to the learning rate alpha and then third in importance after fiddling around with the others,the number of layers can sometimes make a huge difference,and so can learning rate decay.And then when using the Adam algorithm I actually pretty much never tuned beta one,beta two, and epsilon.Pretty much I always use 0.9,0.999 and tenth minus eight although you can try tuning those as well if you wish.But hopefully it does give you some rough sense of what hyperparameters might be more important than others alpha,most important, for sure,followed maybe by the ones I’ve circle in orange,followed maybe by the ones I circled in purple.But this isn’t a hard and fast rule and I think other deep learning practitioners may well disagree with me or have different intuitions on these.

结果证实 一些超参数比其它的更为重要，我认为最广泛的学习应用是 α α \alpha ，学习速率是需要调试的最重要的超参数，除了 α α \alpha 还有一些参数需要调试，例如 momentum ，0.9 就是很个好的默认值，我还会调试 mini-batch 的大小，以确保最优算法运行有效，我还会经常调试隐藏单元，我用橙色圈住的这些，这三个是我觉得其次比较重要的，重要性排第三位的是其他因素，层数有时会产生很大的影响，学习率衰减也是如此，当应用 Adam 算法时事实上我从不调试 β1，β2 和 ε，我总是选定其分别为 0.9，0.999 和 10(−8) 10 ( − 8 ) 10^{(-8)} ，如果你想的话也可以调试它们，但希望你粗略了解到哪些超参数较为重要， α α \alpha 无疑是最重要的，接下来是我用橙色圈住的那些，然后是我用紫色圈住的那些，但这不是严格快速的标准，我认为其它深度学习的研究者，可能会很不同意我的观点或有着不同的直觉。

Now, if you’re trying to tune some set of hyperparameters,how do you select a set of values to explore?In earlier generations of machine learning algorithms,if you had two hyperparameters,which I’m calling hyperparameter one and hyperparameter two here,it was common practice to sample the points in a grid like so and systematically explore these values.Here I am placing down a five by five grid.In practice, it could be more or less than the five by five gridbut you try out in this example all 25 points and then pick whichever hyperparameter works best.And this practice works okay when the number of hyperparameters was relatively small.In deep learning, what we tend to do,and what I recommend you do instead is choose the points at random.So go ahead and choose maybe of same number of points, right? 25 points and then try out the hyperparameters on this randomly chosen set of points.And the reason you do that is that it’s difficult to know inadvance which hyperparameters are going to be the most important for your problem.And as you saw in the previous slide,some hyperparameters are actually much more important than others.

现在 如果你尝试调整一些超参数，该如何选择调试值呢？在早一代的机器学习算法中，如果你有两个超参数，这里我会称之为超参一超参二，常见的做法是在网格中取样点，像这样然后系统的研究这些数值，这里我放置的是 5∗5 5 ∗ 5 5*5的网格，实践证明网格可以是 5∗5 5 ∗ 5 5*5 也可多或少，但对于这个例子你可以尝试这所有的 25 个点然后选择哪个参数效果最好，当参数的数量相对较少时这个方法很实用，在深度学习领域 我们常做的，我推荐你采用下面的做法，随机选择点，所以你可以选择同等数量的点对吗？， 25 个点接着用这些随机取的点试验超参数的效果，之所以这么做是因为，对于你要解决的问题而言你很难提前知道哪个超参数最重要，正如你之前看到的，一些超参数的确要比其它的更重要。

So to take an example,let’s say hyperparameter one turns out to be alpha, the learning rate.And to take an extreme example,let’s say that hyperparameter two was that value epsilon that you have in the denominator of the Adam algorithm.So your choice of alpha matters a lot and your choice of epsilon hardly matters.So if you sample in the gridthen you’ve really tried out five values of alpha and you might find that all of the different values of epsilon give you essentially the same answer.So you’ve now trained 25 models and only got into trial five values for the learning rate alpha,which I think is really important.Whereas in contrast, if you were to sample at random,then you will have tried out 25 distinct values of the learning rate alpha and therefore you be more likely to find a value that works really well.I’ve explained this example,using just two hyperparameters.In practice, you might be searching over many more hyperparameters than these,so if you have, say,three hyperparameters, I guess instead of searching over a square,you’re searching over a cube where this third dimension is hyperparameter three and then by sampling with in this three-dimensional cube you get to try out a lot more values of each of your three hyperparameters. And in practice you might be searching over even more hyperparameters than three and sometimes it’s just hard to know in advance which ones turn out to be the really important hyperparameters for your application and sampling at random rather than in the grid shows that you are more richly exploring set of possible values for the most important hyperparameters, whatever they turn out to be.

举个例子，假设超参数一是 α α \alpha 学习速率，取一个极端的例子，假设超参数二，是 Adam 算法中分母中 ε 的值，这种情况下 α α \alpha 的取值很重要而 ε 取值则无关紧要，如果你在网格中取点，接着你试验了 α α \alpha 的5个取值，那你会发现，无论 ε 取何值结果基本上都是一样的，所以你知道共有 25 种模型，但进行试验的 α α \alpha 值只有5个，我认为这是很重要的，对比而言如果你随机取值，你会试验 25 个独立的 α α \alpha 值，所以你似乎更可能，发现效果最好的那个，我已经解释了，两个参数的情况，实践中你搜索的超参数可能不止两个，假如，你有三个超参数这时你搜索的不是一个方格，而是一个立方体超参数三代表第三维，接着在三维立方体中取值，你会试验大量的更多的值，三个超参数中每个都是，实践中，你搜索的可能不止三个超参数，有时很难预知，哪个是最重要的超参数对于你的具体应用而言，随机取值而不是网格取值表明，你探究了更多重要超参数的潜在值，无论结果是什么。

When you sample hyperparameters,another common practice is to use a coarse to fine sampling scheme.So let’s say in this two-dimensional example that you sample these points,and maybe you found that this point worked the best and maybe a few other points around it tended to work really well,then in the course of the final scheme what you might do is to zoom in to a smaller region of the hyperparameters and then sample more density within this space.Or maybe again at random,but to then focus more resources on searching within this blue square if you’re suspecting that the best setting,t he hyperparameters may be in this region.So after doing a coarse sample of this entire square, that tells you to then focus on a smaller square.You can then sample more densely into smaller square.So this type of a coarse to fine search is also frequently used.And by trying out these different values of the hyperparameters you can then pick whatever value allows you to do best on your training set objectiveor does best on your development setor whatever you’re trying to optimize in your hyperparameter search process.So I hope this gives you a way to more systematically organize your hyperparameter search process.The two key takeaways are,use random sampling and adequate search andoptionally consider implementing a coarse to fine search process.But there’s even more to hyperparameter search than this.Let’s talk more in the next video about how to choose the right scale on which to sample your hyperparameters.

当你给超参数取值时，另一个惯例是采用由粗糙到精细的策略，比如在二维的那个例子中你进行了取值，也许你会发现效果最好的某个点，也许这个点周围的其他一些点效果也很好，那在接下来要做的，是放大这块小区域然后在其中更密集地取值，或随机取值，聚焦更多的资源 在这个蓝色的方格中搜索，如果你怀疑这些超参数，在这个区域的最优结果，那在整个的方格中进行粗略搜索后，你会知道接下来应该聚焦到更小的方格中，在更小的方格中你可以更密集地取点，所以这种从粗到细的搜索也经常使用，通过试验超参数的不同取值，你可以选择对于训练集目标而言的最优值，或对于开发集而言的最优值，或在超参搜索过程中你最想优化的东西，我希望这能给你提供一种方法，去系统地组织超参数搜索过程，另个关键点是，随机取值和精确搜索，考虑使用由粗糙到精细的搜索过程，但超参数的搜索内容还不止这些，在下一个视频中我会继续讲解关于如何选择，超参数取值的合理范围。

重点总结：

超参数调试处理

早一代的机器学习算法中，超参数比较少的情况下，我们之前利用设置网格点的方式来调试超参数；
但在深度学习领域，超参数较多的情况下，不是设置规则的网格点，而是随机选择点进行调试。这样做是因为在我们处理问题的时候，是无法知道哪个超参数是更重要的，所以随机的方式去测试超参数点的性能，更为合理，这样可以探究更超参数的潜在价值。

如果在某一区域找到一个效果好的点，将关注点放到点附近的小区域内继续寻找。

参考文献：

[1]. 大树先生.吴恩达Coursera深度学习课程 DeepLearning.ai 提炼笔记（2-3）– 超参数调试和 Batch Norm

PS: 欢迎扫码关注公众号：「SelfImprovementLab」！专注「深度学习」，「机器学习」，「人工智能」。以及「早起」，「阅读」，「运动」，「英语」「其他」不定期建群打卡互助活动。

Coursera | Andrew Ng (02-week-3-3.1)—调参处理相关推荐

【原】Coursera—Andrew Ng机器学习—Week 9 习题—异常检测
[原]Coursera-Andrew Ng机器学习-Week 9 习题-异常检测参考文章: (1)[原]Coursera-Andrew Ng机器学习-Week 9 习题-异常检测 (2)https: ...
Coursera | Andrew Ng (01-week-3-3.8)—激活函数的导数
该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等.如有错误,还请批评指教.在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字.因本人一直在学习英语,所以 ...
Coursera | Andrew Ng (01-week-2-2.6)—更多导数的例子
该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等.如有错误,还请批评指教.在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字.因本人一直在学习英语,所以 ...
Coursera | Andrew Ng (01-week-2-2.5)—导数
该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等.如有错误,还请批评指教.在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字.因本人一直在学习英语,所以 ...
Coursera | Andrew Ng (02-week-1-1.12)—梯度的数值逼近
该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等.如有错误,还请批评指教.在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字.因本人一直在学习英语,所以 ...
Coursera | Andrew Ng (02-week-1-1.7)—理解 Dropout
该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等.如有错误,还请批评指教.在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字.因本人一直在学习英语,所以 ...
Coursera | Andrew Ng (01-week-2-2.17)—Jupyter _ ipython 笔记本的快速指南
该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等.如有错误,还请批评指教.在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字.因本人一直在学习英语,所以 ...
Coursera | Andrew Ng (01-week-1-1.2)—什么是神经网络？
什么是神经网络? 该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等.如有错误,还请批评指教.在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字.因本人一 ...
Coursera | Andrew Ng (01-week-1-1.2)—What is a Neural Network?
什么是神经网络? 该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等.如有错误,还请批评指教.在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字.因本人一 ...
Coursera | Andrew Ng (01-week-1-1.3)—用神经网络进行监督学习
该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等.如有错误,还请批评指教.在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字.因本人一直在学习英语,所以 ...

Coursera | Andrew Ng (02-week-3-3.1)—调参处理

重点总结：

Coursera | Andrew Ng (02-week-3-3.1)—调参处理相关推荐

最新文章

热门文章