Coursera | Andrew Ng (01-week-3-3.8)

该系列仅在原课程基础上部分知识点添加个人学习笔记，或相关推导补充等。如有错误，还请批评指教。在学习了 Andrew Ng 课程的基础上，为了更方便的查阅复习，将其整理成文字。因本人一直在学习英语，所以该系列以英文为主，同时也建议读者以英文为主，中文辅助，以便后期进阶时，为学习相关领域的学术论文做铺垫。- ZJ

Coursera 课程 |deeplearning.ai |网易云课堂

转载请注明作者和出处：ZJ 微信公众号-「SelfImprovementLab」

知乎：https://zhuanlan.zhihu.com/c_147249273

CSDN：http://blog.csdn.net/junjun_zhao/article/details/79001761

3.8 Derivatives of activation functions (激活函数的导数 )

(字幕来源：网易云课堂)

When you implement back-propagation for your neural network,you need to really compute the slope or the derivative of the activation functions,So let’s take a look at our choices of activation functions,and how you can compute the slope of these functions,you can see familiar sigmoidsigmoid activation function,and so for any given value of z,maybe this value of z this function will have,some slope or some derivative corresponding to,if you draw a little line there you know the height over width,there’s a little triangle here,so if g(z)g(z) is the sigmoidsigmoid function,then the slope of the function is d/dz g(z)g(z),and so we know from calculus that the this slope of g(x) at z,and if you are familiar with calculus and know how to take derivatives.

当你对你的神经网络使用反向传播的时候，你真的需要计算激活函数的斜率或者导数，我们看看激活函数的选择，以及如何计算这些函数的斜率，你可以看到很熟悉的sigmoidsigmoid激活函数，所以对于任意给定的 z 值，也许这个 z 的函数会有，某个斜率和导数对应于，如果你这里画一条线用高度除以宽度，这里有个小三角形，所以如果g(z)g(z)是sigmoidsigmoid函数，那么函数的斜率就是d/dzd/dz g(z)g(z)，我们从微积分知道这就是g(xg(x)在zz处的斜率，如果你很熟悉微积分知道怎么求导。

if you take the derivative of the sigmoidsigmoid function,it is possible to show that it is equal to this formula,and again I’m not going to do the calculus steps,but if you’re familiar with calculus be free to pause the video,and try to prove this yourself,and so this is equal to just g(z)g(z) times 1 minus g(z)g(z),so let’s just sanity check that this expression makes sense,first if z is very large, so say z is equal to 10,then g(z)g(z) will be close to 1,and so the formula we have on the Left tells us that,d/dz g(z)g(z) must be close to g(z)g(z) which is equal to,1 times 1 minus 1 which is therefore very close to 0,And this is indeed correct because when z is very large,the slope is close to 0, conversely if z is equal to minus 10,so that’s way out there, then g(z)g(z) is close to 0,so the following on the left tells us d/dz g(z)g(z) will be close to g(z)g(z),which is 0 times 1 minus zero,and so this is also very close to 0 which is correct,finally at z is equal to 0 then g(z)g(z) is equal to 1/2,that’s a sigmoidsigmoid function right here.

如果你对sigmoidsigmoid函数求导，那么你可以证明就等于这个公式，再次我不打算具体去算微积分，如果你对微积分很熟可以暂停视频，自己去证明，所以这等于g(z)⋅(1−g(x))g(z)·(1-g(x))，我们看看这个式子是否合理，首先如果z非常大比如说z=10，那么g(z)g(z)就很接近 1，左边的公式告诉我们，ddzg(z)\dfrac{d}{dz} g(z)必须很接近.. g(z)g(z)，这等于 1⋅(1−1)1·(1-1) 所以很接近0，这实际上是对的因为当z很大的时候，斜率接近 0 相反如果zz等于负 10，在很远的位置那么g(z)g(z)很接近 0，所以按左边的公式告诉我们ddzg(z)\dfrac{d}{dz} g(z)很接近 g(x)g(x)，就是 0·(1-0)，所以这边也很接近 0 所以是正确的，最后在 z=0 处 g(z)=1/2g(z)=1/2，这就是sigmoidsigmoid函数。

and so the derivative is on equal to 1/2 times 1 minus 1/2,which is equal to 1/4,and that actually is turns out to be the correct value of the derivative,or the slope of this function when z is equal to 0.Finally just to introduce one more piece of notation,sometimes instead of writing this thing,the shorthand for the derivative is g prime of z,so g prime of z in calculus the, the little dash on top is called prime,but so g prime of z is a shorthand for the,in calculus for the derivative of the function of g with respect to the input variable z,um and then in a neural network,we have a equals g(z)g(z) right equals this,then this formula also simplifies to a times 1 minus a,so sometimes the implementation you might see something like,g prime of z equals a times 1 minus a,and that just refers to you,know the observation that g prime which is this derivative,is equal to this over here,and the advantage of this formula is that,if you’ve already computed the value for a,then by using this expression,you can very quickly compute the value for the slope for g prime z,alright so that was the sigmoidsigmoid activation function.

所以导数就等于 1/2·(1-1/2) ，这等于 1/4，可以证明这是正确的导数值，或者 z=0 时正确的函数斜率，最后再介绍一个符号约定，有时我们不这样写，导数可以用g′(z)g'(z)表示，所以在微积分中g′(z)g'(z) 上面这一撇叫 prime，所以g′(z)g'(z)就表示，在微积分中表示函数g对输入变量z的导数，然后在神经网络中，我们有a=g(z)a=g(z) 等于这个，这个公式就化简成a⋅(1−a)a·(1-a)，所以有时在实现的时候你可能会见到这种，式子g′(z)=a⋅(1−a)g'(z)=a·(1-a)，那就表示，你知道g′g'表示导数，就等于这边的式子，然后这个公式的优点在于，如果你已经计算出a值了，那么用这个式子，就可以很快算出g(z)g(z)的斜率，好所以这是sigmoidsigmoid激活函数的导数。

let’s now look at the tanh activation function,similar to what we had previously,the definition of d/dz g(z)g(z) is the slope of g(z)g(z) at a particular point of z,and if you look at the formula for the hyperbolic tangent function,and if you know calculus you can take derivatives,and show that this simplifies to this formula,and using the shorthand we have previously,when we call this g prime of z, again,so if you want you can sanity check that,this formula make sense so for example,if z is equal to 10, tanh(z) will be very close to 1,this goes from plus 1 to minus 1,and then g prime of z according to this formula will be about 1 minus 1 squared,so g′(z)g'(z) closes zero so that was a z is very large,the slope is close to zero,conversely a z is very small, say z is equal to minus 10,then tanh(z) will be close to minus 1,and so g prime of z will be close to 1 minus negative 1 squared,so it’s close to 1 minus 1 which is also close to 0,and finally is equal to 0 then tanh(z) is equal to 0,and then the slope is actually equal to 1,which is which is just the slope point on z is equal to 0.So just to summarize, if a is equal to g(z)g(z),so if a is equal to this tanh(z),then the derivative g prime of z is equal to 1 minus a squared,so once again if you’ve already computed the value of a,you can use this formula to very quickly compute the derivative as well.

我们来看看tanh激活函数，和之前的讨论类似，ddzg(z)\dfrac{d}{dz} g(z)就是在特定点zz上g(z)g(z)的斜率，如果你观察一下双曲正切函数的式子，如果你微积分学得不错你就可以求导，并证明这个式子可以简化成..，我们可以用之前说的写法，我们将这个称为g′(z)g'(z)，如果你想检查一下，这个公式有没有错比如，如果 z 等于 10 那么tanh(z)tanh(z)会很接近 1，这是从 -1 到 1 的函数，然后根据这个式子 g′(z)g'(z)大概就是1−121-1^2，所以g′(z)g'(z)很接近0 .. 所以当z很大的时候，斜率接近 0，相对来说如果 z 很小比如说 z=-10，那么tan(z)tan(z)就很接近 -1，所以g′(z)g'(z)就很接近 1−(−1)21-(-1)^2，所以很接近1-1 很接近 0，最后 z=0 处 tanh(z)tanh(z)就等于 0，然后斜率实际上等于1，在 z=0 处tanh()tanh()函数斜率为1。所以总结一下如果 a=g(z)g(z)，如果 a 等于这个tanh(z)tanh(z)，那么导数g′(z)g'(z)就等于1−a²1-a²，再次如果你已经算出a的值了，那就可以用这个公式快速计算导数。

Finally here’s how you compute the derivatives for the,ReLUReLU and leaky ReLUReLU activation functions,for the value g(x) is equal to max of 0 comma z,so the derivative is equal to you turns out to be 0 if z is less than 0,and 1 if z is greater than 0,and is actually our undefined technically undefined,the z is equal to exactly 0,but um if you’re implementing this in software,it might not be a hundred percent mathematically correct,but the work just fine if z is exactly really zero,if you set the derivative to be equal to 1,or set it to be 0,it kind of doesn’t matter if you’re an experienced optimization technically,g prime then becomes what’s called a sub gradient of the activation function g(z)g(z),which is why gradient descent still works.

最后我们看看如何计算，ReLUReLU和带泄漏的ReLUReLU激活函数的导数，对于ReLUReLU g(x)=max(0,z)g(x)=max(0,z)，如果 z<0 导数就等于 0，z>0 导数就等于 1，然后在 z 精确等于 0 处，斜率是没有定义的，但如果你在软件中实现这个算法，可能数学上不是百分之百正确，但实际是能行的如果z刚好在 0，你可以令导数为1，或者令导数为 0，这其实无关紧要如果你对优化术语很熟悉，g′g’就变成所谓的激活函数g(z)g(z)的次梯度，这样梯度下降法仍然有效。

but you can think of it as that,the chance of z being you know zero point exactly 0 0 0 0 0 is so small,that it almost doesn’t matter what you set the derivative to be equal to,when z is equal to zero.So in practice this is what people implement for the derivative of z,and finally if you are training on your own network with,the leaky ReLUReLU activation function,then g(z)g(z) is going to be max of say 0.01 z comma z,and so g prime of z is equal to 0.01 if z is less than 0,and 1, if z is greater than 0,and once again the gradient is technically not defined when z is exactly equal to 0,but some maybe implement a piece of code that sets the derivative,or the sets g prime z either a 0.01 or to 1,either way it doesn’t really matter when z is exactly 0,your code would work just fine.So arm with of these formulas,you should either compute the slopes or the derivatives of your activation functions,now we have this building blocks,you’re ready to see how to implement gradient descent for your neural network,let’s go onto the next video to see that.

但是你可以这么想，z 精确为 0 的概率非常非常小，所以你将 z=0 处的导数设成哪个值，实际无关紧要，所以在实践中人们一般这么定 z 的导数，最后如果你在训练自己的网络时，带泄漏的ReLUReLU激活函数的网络，那么g(z)g(z)就是max(0.01z,z)max(0.01z,z)，所以 z 小于 0 时 g′(z)g'(z)就等于 0.01，z>0 g′(z)g'(z)就等于 1，再次 z 精确为 0 时的梯度技术上是没定义的，但你可以写一段代码去定义这个梯度，在 z=0 处令g′(z)g'(z)为 0.01 或 1，用哪个值其实无关紧要，你的程序也是能工作的，掌握了这些式子，你应该计算这些激活函数的斜率或导数，现在我们有了这个基础工具，你就已经准备好如何在你的神经网络上实现梯度下降算法了，让我们来看下一个视频。

重点总结：

sigmoid：a=11+e−za = \dfrac{1}{1+e^{-z}}

导数：a′=a(1−a)a' = a(1-a)

tanh：a=ez−e−zez+e−za=\dfrac{e^{z}-e^{-z}}{e^{z}+e^{-z}}

导数：a′=1−a2a'=1-a^{2}

ReLU（修正线性单元）：a=max(0,z)a=max(0,z)

Leaky ReLU：a=max(0.01z,z)max(0.01z,z)

激活函数的选择：

sigmoidsigmoid 函数和 tanhtanh 函数比较：

隐藏层：tanhtanh 函数的表现要好于 sigmoidsigmoid 函数，因为 tanhtanh 取值范围为[−1,+1]，输出分布在 0 值的附近，均值为 0，从隐藏层到输出层数据起到了归一化（均值为 0）的效果。

输出层：对于二分类任务的输出取值为 {0,1}，故一般会选择sigmoidsigmoid函数。
然而sigmoidsigmoid和tanhtanh函数在当|z||z|很大的时候，梯度会很小，在依据梯度的算法中，更新在后期会变得很慢。在实际应用中，要使|z||z|尽可能的落在 0 值附近。

ReLUReLU弥补了前两者的缺陷，当 z>0 时，梯度始终为 1，从而提高神经网络基于梯度算法的运算速度。然而当 z<0 时，梯度一直为0，但是实际的运用中，该缺陷的影响不是很大。

LeakyReLULeaky ReLU保证在 z<0 的时候，梯度仍然不为 0。

在选择激活函数的时候，如果在不知道该选什么的时候就选择ReLUReLU，当然也没有固定答案，要依据实际问题在交叉验证集合中进行验证分析。

参考文献：

[1]. 大树先生.吴恩达Coursera深度学习课程 DeepLearning.ai 提炼笔记（1-3）– 浅层神经网络

PS: 欢迎扫码关注公众号：「SelfImprovementLab」！专注「深度学习」，「机器学习」，「人工智能」。以及「早起」，「阅读」，「运动」，「英语」「其他」不定期建群打卡互助活动。

Coursera | Andrew Ng (01-week-3-3.8)—激活函数的导数相关推荐

【原】Coursera—Andrew Ng机器学习—Week 9 习题—异常检测
[原]Coursera-Andrew Ng机器学习-Week 9 习题-异常检测参考文章: (1)[原]Coursera-Andrew Ng机器学习-Week 9 习题-异常检测 (2)https: ...
Coursera | Andrew Ng (02-week-1-1.12)—梯度的数值逼近
该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等.如有错误,还请批评指教.在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字.因本人一直在学习英语,所以 ...
Coursera | Andrew Ng (01-week-2-2.6)—更多导数的例子
该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等.如有错误,还请批评指教.在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字.因本人一直在学习英语,所以 ...
Coursera | Andrew Ng (01-week-2-2.5)—导数
该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等.如有错误,还请批评指教.在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字.因本人一直在学习英语,所以 ...
Coursera | Andrew Ng (02-week-1-1.7)—理解 Dropout
该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等.如有错误,还请批评指教.在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字.因本人一直在学习英语,所以 ...
Coursera | Andrew Ng (01-week-2-2.17)—Jupyter _ ipython 笔记本的快速指南
该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等.如有错误,还请批评指教.在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字.因本人一直在学习英语,所以 ...
Coursera | Andrew Ng (01-week-1-1.2)—什么是神经网络？
什么是神经网络? 该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等.如有错误,还请批评指教.在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字.因本人一 ...
Coursera | Andrew Ng (01-week-1-1.2)—What is a Neural Network?
什么是神经网络? 该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等.如有错误,还请批评指教.在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字.因本人一 ...
Coursera | Andrew Ng (01-week-1-1.3)—用神经网络进行监督学习
该系列仅在原课程基础上部分知识点添加个人学习笔记,或相关推导补充等.如有错误,还请批评指教.在学习了 Andrew Ng 课程的基础上,为了更方便的查阅复习,将其整理成文字.因本人一直在学习英语,所以 ...

Coursera | Andrew Ng (01-week-3-3.8)—激活函数的导数

重点总结：

Coursera | Andrew Ng (01-week-3-3.8)—激活函数的导数相关推荐

最新文章

热门文章