http://blog.csdn.net/pipisorry/article/details/44119187

机器学习Machine Learning - Andrew NG courses学习笔记

Advice for Applying Machine Learning机器学习应用上的建议
{解决应用机器学习算法遇到的trainning set和test set预测不高的问题}

机器学习算法表现不佳时怎么办

但是不是所有时候增加训练集数据都是有效的！所以选择怎么做之前要学会怎么先去评估evaluate学习算法和诊断diagnostics，这样反而会节省时间。

皮皮blog

假设Hypothesis的评估方法

如何判断模型是否过拟合

直接绘制图形可以判断是否过拟合，但是features多了就不行了。

评估假设hypothesis的数据划分

首先分割数据集为训练集和测试集。

1. if data were already randomly sorted,just take the first 70% and last 30%.
2. if data were not randomly ordered,better to randomly shuffle the examples in your training set.

训练和测试学习算法过程

另一个可选的test sets评估方法可以是错误分类misclassification error，这样解释更简单：

模型选择和Train_Validation_Test集

模型选择：选择features或者选择规格参数regularization parameter。

不同的模型：不同的参数，多项式的度degree

假设还有一个参数d，并使用训练集来确定。不同的d就可以产生很多不同的hypothesis。

使用test set error选择模型

选择test set error最小的那个模型（如d=5时）

但是在测试集上选择参数会导致一个问题：在测试集上选择模型（选择参数d）然后又在测试集上评估模型是不公平的，因为参数d就是在test set上得到的。也即不能在test set中同时选择degree参数和评估hypothesis。 because I had fit this parameter d to my test set is no longer fair to evaluate my hypothesis on this test set, because I fit my parameters(the degree d of polynomial) to this test set,And so my hypothesis is likely to do better on this test set than it would on new examples that hasn't seen before.

使用交叉验证集来选择模型

将数据划分成 train set, cross validation set (also called the validation set), test set.

选择cross validation error最小的假设hypothesis。

而使用test set来度量measure或估计estimate选择的模型一般化generalization的误差error。

总结来说就是，通过训练集训练参数得到多个模型，通过交叉验证集来选择最好的模型，使用测试集来评估模型！

皮皮blog

机器学习算法的改进：规格化参数λ 减小偏差Bias方差Variance

如果学习出来的算法总是表现的不好，会是因为模型存在high bias或者high variance问题，换句话说就是存在underfitting或者overfitting问题。

而改进机器学习算法的一种方法就是添加正则项，通过规格化参数来减小偏差和方差。

鉴别overfit(high variance)和underfit(high bias)

模型复杂度（degree of polynomial d增加）太低时，训练误差大，导致欠拟合underfit，bias很大。模型复杂度增加时，训练误差一般当然都会下降（一定范围内交叉验证误差也会降低），但是随着模型复杂度的增加，模型可能过拟合overfit，这时交叉验证误差就会增大，导致variance很大。

Regularization and Bias_Variance 规格化和偏差_方差

规格化参数对偏差和方差的影响

如何自动选择规格化参数λ

[机器学习模型选择：调参参数选择 ]

皮皮blog

机器学习算法的诊断和改进

学习曲线Learning Curves

learning curves ：诊断diagnose学习算法是否存在high bias(underfit)或者high variance(overfit)或者都存在。

注意下面图中的error是针对回归问题的error，如果是分类问题，train的error可能也会随着数据量增加而变小！

high bias(underfit)的情形

high bias时增加数据（当然是从当前导致欠拟合的数据量大小出发看的）误差都不会减小，high bias通过Jcv 和Jtrain反映。训练误差training error最终会和交叉验证误差cross validation error趋近，因为至少在m很大时，参数太少而数据太多。

high variance(overfit)的情形

high varience的显著特征： large gap，此时增加数据量测试集和训练集误差gap会减小。

机器学习算法的改进

通过绘制learning curves就可以判别模型到底出了什么问题，是high bias(underfit)还是high variance(overfit)，再进行相对的改进。

high bias(underfit)的改进：增加features，添加多项式features，减小参数λ，使用更复杂的模型。

high variance(overfit)的改进：获取更多数据，减少features，增大参数λ，使用更简单的模型。（如果使用的评价指标发现训练集指标明显好于测试集，也可能是训练集和测试集的数据分布差异大，亦需要检查一下）

过拟合应该怎么办？

就是high variance(overfit)的改进：获取更多数据，减少features，增大参数λ。

过拟合一般是因为数据少而模型复杂，这样就需要

1 增加数据

或者减小模型复杂度2-7

2 减少features数目（feature列采样）Note: disadvantage, throwing away some of the features, is also throwing away some of the information you have about the problem.

3 加入规格化项（其中L1就相当于减小features数目，而L2是减小参数来减小数据波动，shrinkage减小过拟合），当然已有规格化项时应该增大参数λ

4 引入先验分布，应该和增加规格化项等价

5 防止过拟合加入boosting项

6 对于神经网络，为了避免模型过度训练，可以Early stopping。若指标趋近平稳（或者看学习曲线），及时终止。效果等价于权值衰减（权值误差也是还没到达训练样本最小值点时停止）。或者使用dropout方法防止过拟合。当然神经网络中也可以使用正则化。

7 Gradient noise。引入一个符合高斯分布的noise项，使得在poor initialization时具有更好的鲁棒性。

[深度学习：正则化]

示例：神经网络中每层单元个数的选择及hidden layler个数的选择

Note: fixes high bias: e.g. keep increasing the number of features/number of hidden units in neural network until you have a low bias classifier.

practical advice for choose the architecture or the connectivity pattern of the neural networks.

the other decisions: the number of hidden layers:using a single hidden layer is a reasonable default, but if you want to choose the number of hidden layers, one other thing you can try is find yourself a training cross-validation,and test set split and try training neural networks with one hidden layer or two or three and see which of those neural networks performs best on the cross-validation sets.

皮皮blog

Review

{The poor performance on both the training and test sets suggests a high bias problem，should increase the complexity of the hypothesis, thereby improving the fit to both the train and test data.}

{The learning algorithm finds parameters to minimize training set error, so the performance should be better on the training set than the test set.}

{A model with high variance will still have high test error, so it will generalize poorly.}

from:http://blog.csdn.net/pipisorry/article/details/44245347

ref:Advice for applying Machine Learning

Andrew Ng-Advice for applying Machine Learning.pdf

Machine Learning - X. Advice for Applying Machine Learning机器学习算法的诊断和改进 (Week 6)相关推荐

Machine Learning week 6 quiz: Advice for Applying Machine Learning
Advice for Applying Machine Learning 5 试题 1. You train a learning algorithm, and find that it has un ...
斯坦福大学机器学习第十课“应用机器学习的建议(Advice for applying machine learning)”
斯坦福大学机器学习第十课"应用机器学习的建议(Advice for applying machine learning)" 斯坦福大学机器学习斯坦福大学机器学习第十课"应 ...
斯坦福机器学习视频笔记 Week6 关于机器学习的建议 Advice for Applying Machine Learning...
我们将学习如何系统地提升机器学习算法,告诉你学习算法何时做得不好,并描述如何'调试'你的学习算法和提高其性能的"最佳实践".要优化机器学习算法,需要先了解可以在哪里做最大的改进. ...
Advice for applying machine learning - Diagnosing bias vs. variance
摘要: 本文是吴恩达 (Andrew Ng)老师<机器学习>课程,第十一章<应用机器学习的建议>中第86课时<诊断偏差与方差>的视频原文字幕.为本人在视频学习过程中 ...
Coursera机器学习-第六周-Advice for Applying Machine Learning
Evaluating a Learning Algorithm Desciding What to Try Next 先来看一个有正则的线性回归例子: 当在预测时,有很大的误差,该如何处理? 1.得到 ...
Week 6 测验：Advice for Applying Machine Learning【Maching Learning】
1 You train a learning algorithm, and find that it has unacceptably high error on the test set. You ...
斯坦福大学公开课机器学习：advice for applying machine learning | learning curves （改进学习算法：高偏差和高方差与学习曲线的关系）...
绘制学习曲线非常有用,比如你想检查你的学习算法,运行是否正常.或者你希望改进算法的表现或效果.那么学习曲线就是一种很好的工具.学习曲线可以判断某一个学习算法,是偏差.方差问题,或是二者皆有. 为了绘制 ...
Machine Learning 务实----Applying deep learning to real-world problems
1. Pre-tuning method 在现实世界里应用ML,得到大量精确标注的数据是昂贵的. 如果只有少量精确标注的数据,pre-tuning method可以帮助提升最后训练模型的精度[1]. ...
Machine Learning:如何选择机器学习算法？
2019独角兽企业重金招聘Python工程师标准>>> Machine Learning Algorithms Overview 关于目前最流行的一些机器学习算法,建议阅读: Mac ...
Paper：《Multimodal Machine Learning: A Survey and Taxonomy，多模态机器学习:综述与分类》翻译与解读
Paper:<Multimodal Machine Learning: A Survey and Taxonomy,多模态机器学习:综述与分类>翻译与解读目录 <Multimoda ...

Machine Learning - X. Advice for Applying Machine Learning机器学习算法的诊断和改进 (Week 6)

机器学习算法表现不佳时怎么办

假设Hypothesis的评估方法

如何判断模型是否过拟合