【中英】【吴恩达课后测验】Course 2 - 改善深层神经网络 - 第二周测验

上一篇：【课程2 - 第一周编程作业】※※※※※ 【回到目录】※※※※※下一篇：【课程2 - 第二周编程作业】

第2周测验-优化算法

当输入从第八个mini-batch的第七个的例子的时候，你会用哪种符号表示第三层的激活？
- 【★】 a[3]8(7) a [ 3 ] 8 ( 7 ) a^{[3]{8}(7)}
注：[i]{j}(k)上标表示 第i层，第j小块，第k个示例
关于mini-batch的说法哪个是正确的？
- 【】在不同的mini-batch下，不需要显式地进行循环，就可以实现mini-batch梯度下降，从而使算法同时处理所有的数据（矢量化）。
- 【】使用mini-batch梯度下降训练的时间（一次训练完整个训练集）比使用梯度下降训练的时间要快。
- 【★】mini-batch梯度下降（在单个mini-batch上计算）的一次迭代快于梯度下降的迭代。
注意：矢量化不适用于同时计算多个mini-batch。
为什么最好的mini-batch的大小通常不是1也不是m，而是介于两者之间？
- 【★】如果mini-batch大小为1，则会失去mini-batch示例中矢量化带来的的好处。
- 【★】如果mini-batch的大小是m，那么你会得到批量梯度下降，这需要在进行训练之前对整个训练集进行处理。
如果你的模型的成本 J J J随着迭代次数的增加，绘制出来的图如下，那么：
- 【★】如果你使用的是mini-batch梯度下降，这看起来是可以接受的。但是如果你使用的是下降，那么你的模型就有问题。
注意：使用mini-batch梯度下降会有一些振荡，因为mini-batch中可能会有一些噪音数据。然而，批量梯度下降总是保证在到达最优值之前达到较低的J。
假设一月的前三天卡萨布兰卡的气温是一样的：
一月第一天: θ1" role="presentation" style="position: relative;">θ1θ1θ_1 = 10

一月第二天: θ2 θ 2 θ_2 * 10

假设您使用 β β β= 0.5的指数加权平均来跟踪温度：v0" role="presentation" style="position: relative;">v0v0v_0 = 0， vt v t v_t = βvt−1 β v t − 1 βv_{t -1} +（1- β β β）θt" role="presentation" style="position: relative;">θtθtθ_t。如果 v2 v 2 v_2是在没有偏差修正的情况下计算第2天后的值，并且 vcorrected2 v 2 c o r r e c t e d v ^ {corrected}_2是您使用偏差修正计算的值。这些下面的值是正确的是？
- 【★】 v2 v 2 v_2 = 7.5, vcorrected2 v 2 c o r r e c t e d v^{corrected}_2 = 10
下面哪一个不是比较好的学习率衰减方法？
- 【★】α = et e t e^t * α0 α 0 α_0
请注意：这会使得学习率出现爆炸，而没有衰减。
您在伦敦温度数据集上使用指数加权平均值，您可以使用以下公式来追踪温度： vt v t v_t = βvt β v t βv_t -1 +（1 - β β β）θt" role="presentation" style="position: relative;">θtθtθ_t。下面的红线使用的是β= 0.9来计算的。当你改变β时，你的红色曲线会怎样变化？
- 【★】增加β会使红线稍微向右移动。
- 【★】减少β会在红线内产生更多的振荡。
看一下这个图:

这些图是由梯度下降产生的; 具有动量梯度下降（β= 0.5）和动量梯度下降（β= 0.9）。哪条曲线对应哪种算法？
-【★】（1）是梯度下降。（2）是动量梯度下降（β值比较小）。（3）是动量梯度下降（β比较大）
假设在一个深度学习网络中批处理梯度下降花费了太多的时间来找到一个值的参数值，该值对于成本函数J(W[1],b[1],…,W[L],b[L])来说是很小的值。以下哪些方法可以帮助找到J值较小的参数值？
- 【★】尝试使用 Adam 算法
- 【★】尝试对权重进行更好的随机初始化
- 【★】尝试调整学习率α
- 【★】尝试mini-batch梯度下降
- 【】尝试把权值初始化为0
关于Adam算法，下列哪一个陈述是错误的？
- 【★】Adam应该用于批梯度计算，而不是用于mini-batch。
注: Adam 可以同时使用。

Week 2 Quiz - Optimization algorithms

Which notation would you use to denote the 3rd layer’s activations when the input is the 7th example from the 8th minibatch?
- a^[3]{8}(7)
Note: [i]{j}(k) superscript means i-th layer, j-th minibatch, k-th example
Which of these statements about mini-batch gradient descent do you agree with?
- [ ] You should implement mini-batch gradient descent without an explicit for-loop over different mini-batches, so that the algorithm processes all mini-batches at the same time (vectorization).
- [ ] Training one epoch (one pass through the training set) using mini-batch gradient descent is faster than training one epoch using batch gradient descent.
- [x] One iteration of mini-batch gradient descent (computing on a single mini-batch) is faster than one iteration of batch gradient descent.
Note: Vectorization is not for computing several mini-batches in the same time.
Why is the best mini-batch size usually not 1 and not m, but instead something in-between?
- If the mini-batch size is 1, you lose the benefits of vectorization across examples in the mini-batch.
- If the mini-batch size is m, you end up with batch gradient descent, which has to process the whole training set before making progress.
Suppose your learning algorithm’s cost J, plotted as a function of the number of iterations, looks like this:
- If you’re using mini-batch gradient descent, this looks acceptable. But if you’re using batch gradient descent, something is wrong.
Note: There will be some oscillations when you’re using mini-batch gradient descent since there could be some noisy data example in batches. However batch gradient descent always guarantees a lower J before reaching the optimal.
Suppose the temperature in Casablanca over the first three days of January are the same:

Jan 1st: θ_1 = 10

Jan 2nd: θ_2 * 10

Say you use an exponentially weighted average with β = 0.5 to track the temperature: v_0 = 0, v_t = βv_t−1 + (1 − β)θ_t. If v_2 is the value computed after day 2 without bias correction, and v^corrected_2 is the value you compute with bias correction. What are these values?
- v_2 = 7.5, v^corrected_2 = 10
Which of these is NOT a good learning rate decay scheme? Here, t is the epoch number.
- α = e^t * α_0
Note: This will explode the learning rate rather than decay it.
You use an exponentially weighted average on the London temperature dataset. You use the following to track the temperature: v_t = βv_t−1 + (1 − β)θ_t. The red line below was computed using β = 0.9. What would happen to your red curve as you vary β? (Check the two that apply)
- Increasing β will shift the red line slightly to the right.
- Decreasing β will create more oscillation within the red line.
Consider this figure:

These plots were generated with gradient descent; with gradient descent with momentum (β = 0.5) and gradient descent with momentum (β = 0.9). Which curve corresponds to which algorithm?

(1) is gradient descent. (2) is gradient descent with momentum (small β). (3) is gradient descent with momentum (large β)
Suppose batch gradient descent in a deep network is taking excessively long to find a value of the parameters that achieves a small value for the cost function J(W[1],b[1],…,W[L],b[L]). Which of the following techniques could help find parameter values that attain a small value forJ? (Check all that apply)
- [x] Try using Adam
- [x] Try better random initialization for the weights
- [x] Try tuning the learning rate α
- [x] Try mini-batch gradient descent
- [ ] Try initializing all the weights to zero
Which of the following statements about Adam is False?
- Adam should be used with batch gradient computations, not with mini-batches.
Note: Adam could be used with both.

【中英】【吴恩达课后测验】Course 2 - 改善深层神经网络 - 第二周测验相关推荐

【吴恩达deeplearning.ai】Course 4 - 卷积神经网络 - 第二周测验
←上一篇 ↓↑ 下一篇→ 2.11 计算机视觉现状回到目录 3.1 目标定位总结习题第 111 题在典型的卷积神经网络中,随着网络的深度增加,你能看到的现象是? A. nHn_HnH 和 ...
个人上传：吴恩达课后作业第四周-搭建深层神经网络（英文）
写在前面:本人深度学习萌新,在参考其他第四周作业答案时,发现一些问题,比如:作者用自己的测试数据集验证函数时,连矩阵维度都对不上(就不例证了).在自己编辑并调试后,将前两次作业的数据集(识别猫, 二维 ...
【吴恩达deeplearning.ai】Course 5 - 序列模型 - 第二周测验
1.假设你为10000个单词学习词嵌入,为了捕获全部范围的单词的变化以及意义,那么词嵌入向量应该是10000维的. [ ] 正确 [★] 错误 2.什么是t-SNE? [★] 一种非线性降维算法. [ ...
【中文】【吴恩达课后编程作业】Course 1 - 神经网络和深度学习 - 第四周作业(12)
[吴恩达课后编程作业]01 - 神经网络和深度学习 - 第四周 - PA1&2 - 一步步搭建多层神经网络以及应用上一篇: [课程1 - 第四周测验]※※※※※ [回到目录]※※※※※下一篇 ...
吴恩达deeplearning.ai系列课程笔记+编程作业(6)第二课改善深层神经网络-第二周：优化算法 (Optimization algorithms)
第二门课改善深层神经网络:超参数调试.正则化以及优化(Improving Deep Neural Networks:Hyperparameter tuning, Regularization and ...
Operations on word vectors-v2 吴恩达老师深度学习课程第五课第二周编程作业1
吴恩达老师深度学习课程第五课(RNN)第二周编程作业1, 包含答案 Operations on word vectors Welcome to your first assignment of thi ...
吴恩达深度学习笔记4-Course1-Week4【深层神经网络】
深层神经网络(DNN): 一.深层神经网络 4层的神经网络: 二.前向与反向传播前向 (forward propagation): 反向 (backward propagation): notati ...
吴恩达Coursera深度学习课程 course1-week4 深层神经网络作业
P0 前言第一门课 : 神经网络与深度学习第四周 : Deep Neural Networks(深层神经网络) 主要知识点 : 深度神经网络.DNN的前向和反向传播(Forward & B ...
吴恩达深度学习课程-Course 4 卷积神经网络第一周卷积神经网络编程作业（第一部分）
时隔三个月终于有时间更新了-在ppt的夹缝中练习. 期待圣诞节!!! 卷积神经网络:Step by Step 1 - 导入相关包 2 - 作业大纲 3 - 卷积神经网络 3.1 - 零填充 3.2 - ...

【中英】【吴恩达课后测验】Course 2 - 改善深层神经网络 - 第二周测验

【中英】【吴恩达课后测验】Course 2 - 改善深层神经网络 - 第二周测验

第2周测验-优化算法

Week 2 Quiz - Optimization algorithms

【中英】【吴恩达课后测验】Course 2 - 改善深层神经网络 - 第二周测验相关推荐

最新文章

热门文章