该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等。如有错误,还请批评指教。在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字。因本人一直在学习英语,所以该系列以英文为主,同时也建议读者以英文为主,中文辅助,以便后期进阶时,为学习相关领域的学术论文做铺垫。- ZJ

Coursera 课程 |deeplearning.ai |网易云课堂


转载请注明作者和出处:ZJ 微信公众号-「SelfImprovementLab」

知乎:https://zhuanlan.zhihu.com/c_147249273

CSDN:http://blog.csdn.net/junjun_zhao/article/details/79080964


1.12 Numerical approximation of gradients(梯度的数值逼近)

(字幕来源:网易云课堂)

When you implement back propagation you’ll find that there’s a test called Gradient Checking that can really help you make sure that your implementation of back prop is correct.Because sometimes you write all these equations and you’re just not 100% sure if you’ve got all the details right and implementing back propagation. So in order to build up to gradient checking, let’s first talk about how to numerically approximate computations of gradients, and in the next video, we’ll talk about how you can implement gradient checking to make sure the implementation of backprop is correct.

在实施 backprop 时,有一个测试叫作梯度检验,它的作用是确保 backprop 正确实施,因为有时候你虽然写下了这些方程式 却不能 100% 确定,执行backprop 的所有细节都是正确的,为了逐渐实现梯度检验,我们首先说说如何对计算梯度做数值逼近,下节课 我们将讨论如何在 backprop 中执行梯度检验,以确保 backprop 正确实施。

So lets take the function f and replot it here, and remember this is f(θ)f(θ)f(θ) equals θθθ cubed, and let’s again start off some value of theta, let’s say theta equals 1. Now instead of just nudging theta to the right to get theta plus epsilon, we’re going to nudge it to the right and nudge it to the left to get theta minus epsilon, as well as θ+ε" role="presentation">θ+εθ+εθ+ε. So this is 1, this is 1.01, this is 0.99 where, again, epsilon is same as before, it is 0.01. It turns out that rather than taking this little triangle and computing the height over the width, you can get a much better estimate of the gradient if you take this point, f of theta minus epsilon and this point,and you instead compute the height over width of this bigger triangle.So for technical reasons which I won’t go into, the height over width of this bigger green triangle gives you a much better approximation to the derivative at theta.

我们先画出函数 fff,标记为f(θ)" role="presentation">f(θ)f(θ)f(θ) f(θ)=θ3f(θ)=θ3f(θ)=θ^3,先看下θθθ的值 假设 θ=1" role="presentation">θ=1θ=1θ=1,不增大θθθ的值 而是在θ" role="presentation">θθθ右侧,设置一个θ+εθ+εθ+ε,在θθθ左侧 设置θ−ε" role="presentation">θ−εθ−εθ-ε,因此 θ=1θ=1θ=1 θ+ε=1.01θ+ε=1.01θ+ε=1.01 θ−ε=0.99θ−ε=0.99θ-ε=0.99,跟以前一样εε ε 的值为 0.01, 看下这个小三角形,计算高和宽的比值 就是更准确的坡度预估,选择f" role="presentation">fff函数在θ−εθ−εθ-ε上的这个点,用这个较大三角形的高比上宽,技术上的原因我就不详细解释了,较大三角形的高宽比值更接近于θθθ的导数。

And you saw it yourself, taking just this lower triangle in the upper rightis as if you have two triangles, right?This one on the upper right and this one on the lower left. And you’re kind of taking both of them into account, by using this bigger green triangle. So rather than a one sided difference, you’re taking a two sided difference.

把右上角的三角形下移,好像有了两个三角形,右上角有一个 左下角有一个,我们通过这个绿色大三角形同时考虑了这两个小三角形,所以我们得到的不是一个单边公差而是一个双边公差

So let’s work out the math.This point here is f(θ+ε)" role="presentation">f(θ+ε)f(θ+ε)f(θ+ε).This point here is f(θ−ε)f(θ−ε)f(θ-ε).So the height of this big green triangle is f(θ+ε)−f(θ−ε)f(θ+ε)−f(θ−ε)f(θ+ε)-f(θ-ε) And then the width, this is 1 epsilon, this is 2 epsilon.So the width of this green triangle is 2ε2ε2ε.So the height of the width is going to be first the height,so that’s f(θ+ε)−f(θ−ε)f(θ+ε)−f(θ−ε)f(θ+ε)-f(θ-ε) divided by the width.So that was 2ε2ε2ε which we write that down here.And this should hopefully be close to g(θ)g(θ)g(θ).So plug in the values, remember f(θ)f(θ)f(θ) is theta cubed.So this is theta plus epsilon is 1.01.So I take a cube of that minus 0.99 take a cube of that, divided by 2 times 0.01.Feel free to pause the video and practice this in the calculator. You should get that this is 3.0001. Whereas from the previous slide, we saw that g(θ)g(θ)g(θ), this was 3 theta squared, so when theta was 1, this is 3, g(θ)=3θ2g(θ)=3θ2g(θ)=3θ^2so these two values are actually very close to each other. The approximation error is now 0.0001. Whereas on the previous slide, we’ve taken the one sided of difference, just theta and theta plus epsilon, we had gotten 3.0301 and so the approximation error was 0.03, rather than 0.0001.

我们写一下数据算式,这点的值是f(θ+ε)f(θ+ε)f(θ+ε),这点的是f(θ−ε)f(θ−ε)f(θ-ε),这个三角形的高度是f(θ+ε)−f(θ−ε)f(θ+ε)−f(θ−ε)f(θ+ε)-f(θ-ε),这两个宽度都是 εεε,所以三角形的宽度是 2ε" role="presentation">2ε2ε2ε,高宽比值为, f(θ+ε)−f(θ−ε)f(θ+ε)−f(θ−ε)f(θ+ε)-f(θ-ε) 除以宽度,宽度为 2ε2ε2ε 结果为 f(θ+ε)−f(θ−ε)2εf(θ+ε)−f(θ−ε)2ε\dfrac{f(θ+ε)-f(θ-ε)}{2ε} ,它的期望值接近 g(θ)g(θ)g(θ),传入参数值 f(θ)=θ3f(θ)=θ3f(θ)=θ^3,θ+ε=1.01θ+ε=1.01θ+ε=1.01,(1.01)3−(0.99)32(0.01)(1.01)3−(0.99)32(0.01)\dfrac{(1.01)^3-(0.99)^3}{2(0.01)},大家可以暂停视频 用计算器算算结果, 结果应该是 3.0001,而前一张幻灯片上面是,当 θ=1 时 g(θ)=3g(θ)=3g(θ)=3,所以这两个 g(θ)g(θ)g(θ) 值非常接近,逼近误差为 0.0001,前一张幻灯片,我们只考虑了单边公差 即从θθθ到θ+ε" role="presentation">θ+εθ+εθ+ε之间的误差,g(θ)g(θ)g(θ)的值为 3.0301,逼近误差是 0.03 不是 0.0001。

So with this two sided difference way of approximating the derivative you find that this is extremely close to 3. And so this gives you a much greater confidence that g(θ)g(θ)g(θ) is probably a correct implementation of the derivative of f.When you use this method for gradient checking and back propagation, this turns out to run twice as slow as you were to use a one-sided difference. It turns out that, in practice, I think it’s worth it to use this other method,because it’s just much more accurate.The little bit of optional theory for those of you that are a little bit more familiar of Calculus,it turns out that, and it’s okay if you don’t get what I’m about to say here.But it turns out that the formal definition of a derivative is for very small values of epsilonis f(θ+ε)−f(θ−ε)2εf(θ+ε)−f(θ−ε)2ε\dfrac{f(θ+ε)-f(θ-ε)}{2ε} .And the formal definition of derivative is in the limits of exactlythat formula on the right, as epsilon goes as 0.And the definition of unlimited is something that you learned if you took a Calculus class,but I won’t go into that here.And it turns out that for a non zero value of epsilon,you can show that the error of this approximation is on the order of epsilon squared,and remember epsilon is a very small number.So if epsilon is 0.01, which it is here,then epsilon squared is 0.0001.The big O notation means the error is actually some constant times this, but this is actually exactly our approximation error. So the big O constant happens to be 1.

所以使用双边误差的方法更逼近导数,其结果接近于 3,现在我们更加确信,g(θ)g(θ)g(θ)可能是一个ff f导数的正确实现,在梯度检验和反向传播中使用该方法时,最终 它与运行两次单边公差的速度一样,实际上 我认为这种方法还是非常值得使用的,因为它的结果更准确,这是一些你可能比较熟悉的微积分的理论,如果你不太明白我讲的这些理论也没关系,导数的官方定义是针对值很小的 ε" role="presentation">εεε, f(θ+ε)−f(θ−ε)2εf(θ+ε)−f(θ−ε)2ε\dfrac{f(θ+ε)-f(θ-ε)}{2ε} ,导数的官方定义是右边公式的极限,εεε趋近于 0,如果你上过微积分课 应该学过无穷尽的定义,我就不在这里讲了,对于一个非零的ε" role="presentation">εε ε,它的逼近误差可以写成 О(ε2)О(ε2)О(ε^2),ε 值非常小,如果 ε=0.01ε=0.01ε=0.01, ε2ε2ε^2=0.0001,大写符号 O 的含义是指逼近误差其实是一些常量乘以 ε2ε2ε^2,但它的确是很准确的逼近误差,所以大写O的常量有时是1。

Whereas in contrast, if we were to use this formula, the other one,then the error is on the order of epsilon.And again, when epsilon is a number less than 1, then epsilon is actuallymuch bigger than epsilon squared, which is why this formula here is actuallymuch less accurate approximation than this formula on the left;which is why when doing gradient checking,we rather use this two-sided difference when you compute f(θ+ε)−f(θ−ε)f(θ+ε)−f(θ−ε)f(θ+ε)-f(θ-ε) and then divide by 2ε,rather than just one sided difference which is less accurate. If you didn’t understand my last two comments, all of these things are on here, don’t worry about it. That’s really more for those of you that are a bit more familiar with Calculus, and with numerical approximations.But the takeaway is that this two-sided difference formula is much more accurate.And so that’s what we’re gonna use when we do gradient checking in the next video. So you’ve seen how by taking a two sided difference, you can numerically verify whether or not a function g, g(θ)g(θ)g(θ) that someone else gives youis a correct implementation of the derivative of a function f. Let’s now see how we can use this to verify whether or not your back propagation implementation is correct oryou know, there might be a bug in there that you need to go in and tease out.

然而 如果我们用另外一个公式,逼近误差就是О(ε)О(ε)О(ε),当εεε小于 1 时 实际上ε" role="presentation">εεε比 ε2ε2ε^2大很多,所以这个公式,近似值远没有左边公式的准确,所以在执行梯度检验时,我们使用双边误差 即 f(θ+ε)−f(θ−ε)2εf(θ+ε)−f(θ−ε)2ε\dfrac{f(θ+ε)-f(θ-ε)}{2ε} ,而不使用单边公差 因为它不够准确,如果你不理解上面两条结论 所有公式都在这儿,不用担心,如果你对微积分和数值逼近有所了解,这些信息已经足够多了,重点是要记住双边误差公式的结果更准确,下节课我们做梯度检验时就会用到这个方法,今天我们讲了如何使用双边误差,来判断别人给你的函数g(θ)g(θ)g(θ),是否正确实现了函数fff的偏导,现在我们可以使用这个方法来检验,反向传播是否得以正确实施,如果不正确它可能有 bug 需要你来解决。


重点总结:

梯度的数值逼近

使用双边误差的方法去逼近导数

由图可以看出,双边误差逼近的误差是0.0001,先比单边逼近的误差0.03,其精度要高了很多。

涉及的公式:

  • 双边导数:

f′(θ)=limε→0=f(θ+ε)−f(θ−ε)2ε" role="presentation">f′(θ)=limε→0=f(θ+ε)−f(θ−ε)2εf′(θ)=limε→0=f(θ+ε)−f(θ−ε)2εf'(\theta) = \lim\limits_{\varepsilon \to 0}=\dfrac{f(\theta+\varepsilon)-f(\theta-\varepsilon)}{2\varepsilon}

误差:O(ε2)O(ε2)O(\varepsilon^{2})

  • 单边导数:

f′(θ)=limε→0=f(θ+ε)−f(θ)εf′(θ)=limε→0=f(θ+ε)−f(θ)εf'(\theta) = \lim\limits_{\varepsilon \to 0}=\dfrac{f(\theta+\varepsilon)-f(\theta)}{\varepsilon}

误差:O(ε)O(ε)O(\varepsilon)

参考文献:

[1]. 大树先生.吴恩达Coursera深度学习课程 DeepLearning.ai 提炼笔记(2-1)– 深度学习的实践方面


PS: 欢迎扫码关注公众号:「SelfImprovementLab」!专注「深度学习」,「机器学习」,「人工智能」。以及 「早起」,「阅读」,「运动」,「英语 」「其他」不定期建群 打卡互助活动。

Coursera | Andrew Ng (02-week-1-1.12)—梯度的数值逼近相关推荐

  1. 【原】Coursera—Andrew Ng机器学习—课程笔记 Lecture 12—Support Vector Machines 支持向量机...

    Lecture 12 支持向量机 Support Vector Machines 12.1 优化目标 Optimization Objective 支持向量机(Support Vector Machi ...

  2. 【原】Coursera—Andrew Ng机器学习—Week 9 习题—异常检测

    [原]Coursera-Andrew Ng机器学习-Week 9 习题-异常检测 参考文章: (1)[原]Coursera-Andrew Ng机器学习-Week 9 习题-异常检测 (2)https: ...

  3. Coursera | Andrew Ng (01-week-2-2.6)—更多导数的例子

    该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等.如有错误,还请批评指教.在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字.因本人一直在学习英语,所以 ...

  4. Coursera | Andrew Ng (01-week-2-2.4)—梯度下降法

    该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等.如有错误,还请批评指教.在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字.因本人一直在学习英语,所以 ...

  5. Coursera | Andrew Ng (01-week-3-3.8)—激活函数的导数

    该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等.如有错误,还请批评指教.在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字.因本人一直在学习英语,所以 ...

  6. Coursera | Andrew Ng (01-week-2-2.5)—导数

    该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等.如有错误,还请批评指教.在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字.因本人一直在学习英语,所以 ...

  7. Coursera | Andrew Ng (02-week-1-1.7)—理解 Dropout

    该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等.如有错误,还请批评指教.在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字.因本人一直在学习英语,所以 ...

  8. Coursera | Andrew Ng (01-week-2-2.17)—Jupyter _ ipython 笔记本的快速指南

    该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等.如有错误,还请批评指教.在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字.因本人一直在学习英语,所以 ...

  9. Coursera | Andrew Ng (01-week-1-1.2)—什么是神经网络?

    什么是神经网络? 该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等.如有错误,还请批评指教.在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字.因本人一 ...

最新文章

  1. 勒索攻击猖獗,在云上如何应对这位“破坏分子”?
  2. 在工作中有被动转主动的体会_积极主动应对眼前的一切,就是对自己最好的犒赏...
  3. Python解析命令行读取参数 -- argparse模块
  4. P7726-天体探测仪(Astral Detector)【构造】
  5. Android 系统(164)---手机收到8bit编码的短信无法显示
  6. 苹果13英寸MacBook Pro有望下月更新 搭载M2芯片
  7. Google Earth Browser Plugin (谷歌 地球 浏览器 插件) 下载地址 5.0
  8. koreader下载_kindle koreader
  9. 红帽linux命令符,红帽子Linux_命令全解.doc
  10. linux和windows截图软件下载,【教程】数字菌教你从windows过渡到linux之软件的替换...
  11. 论文写作---Matlab求解偏导数
  12. (5)air202读取串口数据并上传到阿里云显示
  13. [Hadoop in China 2011] 朱会灿:探析腾讯Typhoon云计算平台
  14. 修改jupyter notebook中的tensorflow版本
  15. 头条搬砖最新实操玩法
  16. C语言学习6:数据类型 -> 基本类型 -> 整型类型(int、short int、long int、char等)
  17. 计算机汉字字模信息怎么算,汉字字模库字模.PPT
  18. 苹果开发者证书提示编辑电话号码
  19. 技嘉B560M VCCIO2电压设计缺陷
  20. AspnetBoilerplate (ABP) Organization Units 组织结构管理

热门文章

  1. bugkuctf never give up
  2. 聊聊 8种 架构模式
  3. arcmap添加字段的类型_ArcGIS 字段数据类型
  4. 淘宝玉伯引发Web前后端研发模式讨论
  5. 【Rust日报】 2019-04-27
  6. linux 内核修改rss,什么是Linux内存管理中的RSS和VSZ
  7. 原来将Excel表格转换成应用程序如此简单
  8. 锂离子电池电量计原理概述
  9. 软件开发人员的职业发展规划
  10. linear regression and logistic regression 1