学习曲线机器学习_机器学习的学习曲线

学习曲线机器学习

Diagnose Bias and Variance to Reduce Error

诊断偏差和方差以减少误差

When building machine learning models, we want to keep error as low as possible. Two major sources of error are bias and variance. If we managed to reduce these two, then we could build more accurate models.

在构建机器学习模型时，我们希望将误差保持在尽可能低的水平。误差的两个主要来源是偏差和方差。如果我们设法减少这两个，那么我们可以建立更准确的模型。

But how do we diagnose bias and variance in the first place? And what actions should we take once we’ve detected something?

但是，我们首先如何诊断偏见和差异？当我们检测到某些东西时应该采取什么行动？

In this post, we’ll learn how to answer both these questions using learning curves. We’ll work with a real world data set and try to predict the electrical energy output of a power plant.

在本文中，我们将学习如何使用学习曲线来回答这两个问题。我们将使用现实世界的数据集，并尝试预测电厂的电能输出。

We’ll generate learning curves while trying to predict the electrical energy output of a power plant. Image source: Pexels.

在尝试预测电厂的电能输出时，我们将生成学习曲线。图片来源： Pexels 。

Some familiarity with scikit-learn and machine learning theory is assumed. If you don’t frown when I say cross-validation or supervised learning, then you’re good to go. If you’re new to machine learning and have never tried scikit, a good place to start is this blog post.

假定您对scikit学习和机器学习理论有所了解。如果在我说交叉验证或监督学习时您不皱眉，那么您就很好了。如果您是机器学习的新手，并且从未尝试过scikit，那么这篇博客文章是一个不错的起点。

We begin with a brief introduction to bias and variance.

我们首先对偏差和方差进行简要介绍。

偏差方差的权衡 (The bias-variance trade-off)

In supervised learning, we assume there’s a real relationship between feature(s) and target and estimate this unknown relationship with a model. Provided the assumption is true, there really is a model, which we’ll call $f$, which describes perfectly the relationship between features and target.

在监督学习中，我们假设特征和目标之间存在真实关系，并使用模型估算这种未知关系。如果假设是正确的，那么确实存在一个模型，我们将其称为$ f $，该模型完美地描述了特征和目标之间的关系。

In practice, $f$ is almost always completely unknown, and we try to estimate it with a model $hat{f}$ (notice the slight difference in notation between $f$ and $hat{f}$). We use a certain training set and get a certain $hat{f}$. If we use a different training set, we are very likely to get a different $hat{f}$. As we keep changing training sets, we get different outputs for $hat{f}$. The amount by which $hat{f}$ varies as we change training sets is called variance.

实际上，$ f $几乎总是完全未知的，我们尝试使用模型$ hat {f} $进行估计（请注意，$ f $和$ hat {f} $的符号略有不同）。我们使用特定的训练集并获得特定的$ hat {f} $。如果我们使用不同的训练集，我们很可能会得到不同的$ hat {f} $。随着我们不断更改训练集，我们获得了$ hat {f} $的不同输出。 $ hat {f} $随着我们更改训练集而变化的量称为方差。

To estimate the true $f$, we use different methods, like linear regression or random forests. Linear regression, for instance, assumes linearity between features and target. For most real-life scenarios, however, the true relationship between features and target is complicated and far from linear. Simplifying assumptions give bias to a model. The more erroneous the assumptions with respect to the true relationship, the higher the bias, and vice-versa.

要估算真实的$ f $，我们使用不同的方法，例如线性回归或随机森林。例如，线性回归假设特征和目标之间是线性的。但是，对于大多数现实生活场景而言，要素与目标之间的真实关系很复杂，而且并非线性关系。简化的假设会使模型产生偏差。关于真实关系的假设越错误，偏差就越大，反之亦然。

Generally, a model $hat{f}$ will have some error when tested on some test data. It can be shown mathematically that both bias and variance can only add to a model’s error. We want a low error, so we need to keep both bias and variance at their minimum. However, that’s not quite possible. There’s a trade-off between bias and variance.

通常，模型$ hat {f} $在某些测试数据上进行测试时会出现一些错误。从数学上可以证明，偏差和方差都只会增加模型的误差。我们希望误差很小，因此我们需要将偏差和方差保持在最小。但是，这不太可能。偏差和方差之间需要权衡。

A low-biased method fits training data very well. If we change training sets, we’ll get significantly different models $hat{f}$.

低偏方法非常适合训练数据。如果我们更改训练集，我们将获得截然不同的模型$ hat {f} $。

You can see that a low-biased method captures most of the differences (even the minor ones) between the different training sets. $hat{f}$ varies a lot as we change training sets, and this indicates high variance.

您可以看到，低偏差方法可以捕获不同训练集之间的大多数差异（甚至是较小的差异）。当我们更改训练集时，$ hat {f} $变化很大，这表明差异很大。

The less biased a method, the greater its ability to fit data well. The greater this ability, the higher the variance. Hence, the lower the bias, the greater the variance.

方法的偏见程度越小，其数据拟合能力越强。此能力越大，方差越大。因此，偏差越小，方差越大。

The reverse also holds: the greater the bias, the lower the variance. A high-bias method builds simplistic models that generally don’t fit well training data. As we change training sets, the models $hat{f}$ we get from a high-bias algorithm are, generally, not very different from one another.

反之亦成立：偏差越大，方差越小。高偏差方法会建立通常不适合训练数据的简单模型。当我们更改训练集时，通常从高偏置算法获得的模型$ hat {f} $彼此之间并没有太大差异。

If $hat{f}$ doesn’t change too much as we change training sets, the variance is low, which proves our point: the greater the bias, the lower the variance.

如果$ hat {f} $在更改训练集时变化不大，则方差很小，这证明了我们的观点：偏差越大，方差越小。

Mathematically, it’s clear why we want low bias and low variance. As mentioned above, bias and variance can only add to a model’s error. From a more intuitive perspective though, we want low bias to avoid building a model that’s too simple. In most cases, a simple model performs poorly on training data, and it’s extremely likely to repeat the poor performance on test data.

从数学上讲，很明显为什么我们要低偏差和低方差。如上所述，偏差和方差只会增加模型的误差。但是从更直观的角度来看，我们希望低偏差以避免构建过于简单的模型。在大多数情况下，简单的模型在训练数据上的表现不佳，并且极有可能在测试数据上重复表现不佳的情况。

Similarly, we want low variance to avoid building an overly complex model. Such a model fits almost perfectly all the data points in the training set. Training data, however, generally contains noise and is only a sample from a much larger population. An overly complex model captures that noise. And when tested on out-of-sample data, the performance is usually poor. That’s because the model learns the sample training data too well. It knows a lot about something and little about anything else.

同样，我们希望低方差避免构建过于复杂的模型。这样的模型几乎完全适合训练集中的所有数据点。但是，训练数据通常包含噪声，仅是来自大量人口的样本。过于复杂的模型会捕获该噪声。当对样本外数据进行测试时，性能通常很差。那是因为模型对样本训练数据的学习太好了。它对某事了解很多，而对其他事情则了解甚少。

In practice, however, we need to accept a trade-off. We can’t have both low bias and low variance, so we want to aim for something in the middle.

但是实际上，我们需要权衡取舍。我们不能同时拥有低偏见和低方差，因此我们希望瞄准中间的东西。

We’ll try to build some practical intuition for this trade-off as we generate and interpret learning curves below.

当我们在下面生成和解释学习曲线时，我们将尝试为此折衷建立一些实用的直觉。

学习曲线–基本思想 (Learning curves – the basic idea)

Let’s say we have some data and split it into a training set and validation set. We take one single instance (that’s right, one!) from the training set and use it to estimate a model. Then we measure the model’s error on the validation set and on that single training instance. The error on the training instance will be 0, since it’s quite easy to perfectly fit a single data point. The error on the validation set, however, will be very large. That’s because the model is built around a single instance, and it almost certainly won’t be able to generalize accurately on data that hasn’t seen before.

假设我们有一些数据，并将其分为训练集和验证集。我们从训练集中选取一个实例（是的，一个！），并用它来估计一个模型。然后，我们在验证集和单个训练实例上测量模型的误差。训练实例上的错误将为0，因为完美拟合单个数据点非常容易。但是，验证集上的错误将非常大。这是因为该模型是围绕单个实例构建的，并且几乎可以肯定的是，该模型无法准确地概括以前未见过的数据。

Now let’s say that instead of one training instance, we take ten and repeat the error measurements. Then we take fifty, one hundred, five hundred, until we use our entire training set. The error scores will vary more or less as we change the training set.

现在让我们说，代替一个训练实例，我们取十个并重复误差测量。然后我们需要五十，一百，五百，直到我们使用了整个训练集。当我们更改训练集时，错误分数或多或少会有所不同。

We thus have two error scores to monitor: one for the validation set, and one for the training sets. If we plot the evolution of the two error scores as training sets change, we end up with two curves. These are called learning curves.

因此，我们需要监控两个错误评分：一个用于验证集，一个用于训练集。如果我们绘制两个误差分数随训练集的变化而变化的演变，我们将得到两条曲线。这些称为学习曲线。

In a nutshell, a learning curve shows how error changes as the training set size increases. The diagram below should help you visualize the process described so far. On the training set column you can see that we constantly increase the size of the training sets. This causes a slight change in our models $hat{f}$.

简而言之，学习曲线显示了误差随着训练集大小的增加而如何变化。下图应帮助您可视化到目前为止描述的过程。在训练集列上，您可以看到我们不断增加训练集的大小。这会导致我们的模型$ hat {f} $发生轻微变化。

In the first row, where n = 1 (n is the number of training instances), the model fits perfectly that single training data point. However, the very same model fits really bad a validation set of 20 different data points. So the model’s error is 0 on the training set, but much higher on the validation set.

在第一行中，其中n = 1（n是训练实例的数量），该模型非常适合单个训练数据点。但是，完全相同的模型非常适合20个不同数据点的验证集。因此，模型的训练集误差为0，而验证集的误差更高。

As we increase the training set size, the model cannot fit perfectly anymore the training set. So the training error becomes larger. However, the model is trained on more data, so it manages to fit better the validation set. Thus, the validation error decreases. To remind you, the validation set stays the same across all three cases.

随着我们增加训练集的大小，该模型不再能够完美地适合训练集。因此训练误差变得更大。但是，该模型接受了更多数据的训练，因此可以更好地拟合验证集。因此，验证误差减小。提醒您，在所有三种情况下，验证集均保持不变。

If we plotted the error scores for each training size, we’d get two learning curves looking similarly to these:

如果我们绘制每种训练量的错误分数，我们将获得两条与这些相似的学习曲线：

Learning curves give us an opportunity to diagnose bias and variance in supervised learning models. We’ll see how that’s possible in what follows.

学习曲线为我们提供了在监督学习模型中诊断偏差和方差的机会。我们将在接下来的内容中看到这种可能性。

数据介绍 (Introducing the data)

The learning curves plotted above are idealized for teaching purposes. In practice, however, they usually look significantly different. So let’s move the discussion in a practical setting by using some real-world data.

上面绘制的学习曲线已理想用于教学目的。然而，实际上，它们通常看起来截然不同。因此，让我们通过使用一些实际数据在实际环境中进行讨论。

We’ll try to build regression models that predict the hourly electrical energy output of a power plant. The data we use come from Turkish researchers Pınar Tüfekci and Heysem Kaya, and can be downloaded from here. As the data is stored in a .xlsx file, we use pandas’ read_excel() function to read it in:

我们将尝试建立回归模型来预测电厂的每小时电能输出。我们使用的数据来自土耳其研究人员PınarTüfekci和Heysem Kaya，可以从此处下载。由于数据存储在.xlsx文件中，因此我们使用pandas的read_excel() 函数读取数据：

import pandas as pdelectricity = pd.read_excel('Folds5x2_pp.xlsx')print(electricity.info())
electricity.head(3)

import pandas as pdelectricity = pd.read_excel('Folds5x2_pp.xlsx')print(electricity.info())
electricity.head(3)

		AT	在	V	V	AP	美联社	RH	相对湿度	PE	聚乙烯
0	0	14.96	14.96	41.76	41.76	1024.07	1024.07	73.17	73.17	463.26	463.26
1	1个	25.18	25.18	62.96	62.96	1020.04	1020.04	59.08	59.08	444.37	444.37
2	2	5.11	5.11	39.40	39.40	1012.16	1012.16	92.14	92.14	488.56	488.56

Let’s quickly decipher each column name:

让我们快速解读每个列名称：

Abbreviation	缩写	Full name	全名
AT	在	Ambiental Temperature	环境温度
V	V	Exhaust Vacuum	排气真空
AP	美联社	Ambiental Pressure	环境压力
RH	相对湿度	Relative Humidity	相对湿度
PE	聚乙烯	Electrical Energy Output	电能输出

The PE column is the target variable, and it describes the net hourly electrical energy output. All the other variables are potential features, and the values for each are actually hourly averages (not net values, like for PE).

PE列是目标变量，它描述了每小时的净电能输出。所有其他变量都是潜在特征，每个变量的值实际上是每小时平均值（不是净值，如PE ）。

The electricity is generated by gas turbines, steam turbines, and heat recovery steam generators. According to the documentation of the data set, the vacuum level has an effect on steam turbines, while the other three variables affect the gas turbines. Consequently, we’ll use all of the feature columns in our regression models.

电力由燃气轮机，蒸汽轮机和热回收蒸汽发生器产生。根据数据集的文档，真空度会影响蒸汽轮机，而其他三个变量会影响燃气轮机。因此，我们将在回归模型中使用所有功能列。

At this step we’d normally put aside a test set, explore the training data thoroughly, remove any outliers, measure correlations, etc. For teaching purposes, however, we’ll assume that’s already done and jump straight to generate some learning curves. Before we start that, it’s worth noticing that there are no missing values. Also, the numbers are unscaled, but we’ll avoid using models that have problems with unscaled data.

在这一步中，我们通常会放置一个测试集，彻底探索训练数据，删除任何异常值，测量相关性，等等。但是，出于教学目的，我们假设已经完成并直接跳出一些学习曲线。在开始之前，需要注意的是没有缺失的值。此外，数字是未缩放的，但我们将避免使用无法缩放数据的模型。

确定训练集的大小 (Deciding upon the training set sizes)

Let’s first decide what training set sizes we want to use for generating the learning curves.

首先，让我们决定要用于生成学习曲线的训练集大小。

The minimum value is 1. The maximum is given by the number of instances in the training set. Our training set has 9568 instances, so the maximum value is 9568.

最小值为1。最大值由训练集中的实例数给出。我们的训练集有9568个实例，因此最大值为9568。

However, we haven’t yet put aside a validation set. We’ll do that using an 80:20 ratio, ending up with a training set of 7654 instances (80%), and a validation set of 1914 instances (20%). Given that our training set will have 7654 instances, the maximum value we can use to generate our learning curves is 7654.

但是，我们尚未将验证集放在一旁。我们将使用80:20的比例进行此操作，最后得到7654个实例的训练集（80％）和1914个实例的验证集（20％）。假设我们的训练集将有7654个实例，则可用于生成学习曲线的最大值为7654。

For our case, here, we use these six sizes:

对于我们的情况，在这里，我们使用以下六个大小：

train_sizes = [1, 100, 500, 2000, 5000, 7654]

train_sizes = [1, 100, 500, 2000, 5000, 7654]

An important thing to be aware of is that for each specified size a new model is trained. If you’re using cross-validation, which we’ll do in this post, k models will be trained for each training size (where k is given by the number of folds used for cross-validation). To save code running time, it’s good practice to limit yourself to 5-10 training sizes.

要意识到的重要一点是，对于每个指定的大小，都会训练一个新模型。如果您使用的是交叉验证，我们将在本文中进行介绍，则将针对每种训练规模训练k个模型（其中k由用于交叉验证的折叠数给出）。为了节省代码运行时间，最好将自己限制为5至10个培训大小。

scikit-learn中的learning_curve（）函数 (The learning_curve() function from scikit-learn)

We’ll use the learning_curve() function from the scikit-learn library to generate a learning curve for a regression model. There’s no need on our part to put aside a validation set because learning_curve() will take care of that.

我们将使用scikit-learn库中的learning_curve() 函数来生成回归模型的学习曲线。我们没有必要搁置验证集，因为learning_curve()会解决这个问题。

In the code cell below, we:

在下面的代码单元中，我们：

Do the required imports from sklearn.
Declare the features and the target.
Use learning_curve() to generate the data needed to plot a learning curve. The function returns a tuple containing three elements: the training set sizes, and the error scores on both the validation sets and the training sets. Inside the function, we use the following parameters:
- estimator — indicates the learning algorithm we use to estimate the true model;
- X — the data containing the features;
- y — the data containing the target;
- train_sizes — specifies the training set sizes to be used;
- cv — determines the cross-validation splitting strategy (we’ll discuss this immediately);
- scoring — indicates the error metric to use; the intention is to use the mean squared error (MSE) metric, but that’s not a possible parameter for scoring; we’ll use the nearest proxy, negative MSE, and we’ll just have to flip signs later on.

从sklearn执行所需的导入。
声明功能和目标。
使用learning_curve()生成绘制学习曲线所需的数据。该函数返回一个包含三个元素的元组：训练集大小以及验证集和训练集上的错误分数。在函数内部，我们使用以下参数：
- estimator -表示我们用于估计真实模型的学习算法；
- X包含X的数据；
- y —包含目标的数据；
- train_sizes —指定要使用的训练集大小；
- cv —确定交叉验证拆分策略（我们将立即讨论）；
- scoring -指示要使用的错误度量；目的是使用均方误差（MSE）度量，但这不是scoring的可能参数；我们将使用最接近的代理服务器（负MSE），并且稍后只需要翻转标志即可。

We already know what’s in train_sizes. Let’s inspect the other two variables to see what learning_curve() returned:

我们已经知道train_sizes 。让我们检查其他两个变量，以了解返回了什么learning_curve() ：

print('Training scores:nn', train_scores)
print('n', '-' * 70) # separator to make the output easy to read
print('nValidation scores:nn', validation_scores)

print('Training scores:nn', train_scores)
print('n', '-' * 70) # separator to make the output easy to read
print('nValidation scores:nn', validation_scores)

Since we specified six training set sizes, you might have expected six values for each kind of score. Instead, we got six rows for each, and every row has five error scores.

由于我们指定了六个训练集大小，因此您可能希望每种分数都有六个值。相反，我们每行有六行，每行有五个错误分数。

This happens because learning_curve() runs a k-fold cross-validation under the hood, where the value of k is given by what we specify for the cv parameter.

发生这种情况是因为learning_curve()在后台运行了k倍交叉验证，其中k的值由我们为cv参数指定的值给出。

In our case, cv = 5, so there will be five splits. For each split, an estimator is trained for every training set size specified. Each column in the two arrays above designates a split, and each row corresponds to a test size. Below is a table for the training error scores to help you understand the process better:

在我们的例子中， cv = 5 ，所以将有五个分割。对于每个分割，针对指定的每个训练集大小训练一个估计量。上面两个数组中的每一列都指定一个拆分，每一行对应一个测试大小。下表是训练错误分数的表格，可帮助您更好地了解过程：

Training set size (index)	训练集大小（索引）	Split1	拆分1	Split2	拆分2	Split3	分割3	Split4	分割4	Split5	分割5
1	1个	0	0	0	0	0	0	0	0	0	0
100	100	-19.71230701	-19.71230701	-18.31492642	-18.31492642	-18.31492642	-18.31492642	-18.31492642	-18.31492642	-18.31492642	-18.31492642
500	500	-18.14420459	-18.14420459	-19.63885072	-19.63885072	-19.63885072	-19.63885072	-19.63885072	-19.63885072	-19.63885072	-19.63885072
2000	2000	-21.53603444	-21.53603444	-20.18568787	-20.18568787	-19.98317419	-19.98317419	-19.98317419	-19.98317419	-19.98317419	-19.98317419
5000	5000	-20.47708899	-20.47708899	-19.93364211	-19.93364211	-20.56091569	-20.56091569	-20.4150839	-20.4150839	-20.4150839	-20.4150839
7654	7654	-20.98565335	-20.98565335	-20.63006094	-20.63006094	-21.04384703	-21.04384703	-20.63526811	-20.63526811	-20.52955609	-20.52955609

To plot the learning curves, we need only a single error score per training set size, not 5. For this reason, in the next code cell we take the mean value of each row and also flip the signs of the error scores (as discussed above).

要绘制学习曲线，我们只需要每个训练集大小的单个错误评分，而不是5。因此，在下一个代码单元中，我们取每一行的平均值，还翻转错误评分的符号（如前所述）以上）。

train_scores_mean = -train_scores.mean(axis = 1)
validation_scores_mean = -validation_scores.mean(axis = 1)print('Mean training scoresnn', pd.Series(train_scores_mean, index = train_sizes))
print('n', '-' * 20) # separator
print('nMean validation scoresnn',pd.Series(validation_scores_mean, index = train_sizes))

train_scores_mean = -train_scores.mean(axis = 1)
validation_scores_mean = -validation_scores.mean(axis = 1)print('Mean training scoresnn', pd.Series(train_scores_mean, index = train_sizes))
print('n', '-' * 20) # separator
print('nMean validation scoresnn',pd.Series(validation_scores_mean, index = train_sizes))

Now we have all the data we need to plot the learning curves.

现在，我们拥有绘制学习曲线所需的所有数据。

Before doing the plotting, however, we need to stop and make an important observation. You might have noticed that some error scores on the training sets are the same. For the row corresponding to training set size of 1, this is expected, but what about other rows? With the exception of the last row, we have a lot of identical values. For instance, take the second row where we have identical values from the second split onward. Why is that so?

但是，在进行绘制之前，我们需要停下来进行重要观察。您可能已经注意到，训练集上的某些错误分数是相同的。对于对应于训练集大小为1的行，这是预期的，但是其他行呢？除了最后一行，我们有很多相同的值。例如，取第二行，从第二个分割开始，我们具有相同的值。为什么呢？

This is caused by not randomizing the training data for each split. Let’s walk through a single example with the aid of the diagram below. When the training size is 500 the first 500 instances in the training set are selected. For the first split, these 500 instances will be taken from the second chunk. From the second split onward, these 500 instances will be taken from the first chunk. Because we don’t randomize the training set, the 500 instances used for training are the same for the second split onward. This explains the identical values from the second split onward for the 500 training instances case.

这是由于未对每个分组随机分配训练数据引起的。让我们借助下图浏览一个示例。当训练大小为500时，将选择训练集中的前500个实例。对于第一次拆分，将从第二个块中获取这500个实例。从第二个分割开始，这500个实例将从第一块中取出。因为我们不随机化训练集，所以用于第二次分割的500个训练实例是相同的。这说明了从500次训练实例情况的第二次分割开始的相同值。

An identical reasoning applies to the 100 instances case, and a similar reasoning applies to the other cases.

相同的推理适用于100个实例情况，类似的推理适用于其他情况。

To stop this behavior, we need to set the shuffle parameter to True in the learning_curve() function. This will randomize the indices for the training data for each split. We haven’t randomized above for two reasons:

要停止这种行为，我们需要在learning_curve()函数中将shuffle参数设置为True 。这将使每个分组的训练数据的索引随机化。由于以下两个原因，我们尚未进行随机分组：

The data comes pre-shuffled five times (as mentioned in the documentation) so there’s no need to randomize anymore.
I wanted to make you aware about this quirk in case you stumble upon it in practice.

数据经过了5次预混洗（如文档中所述），因此不再需要随机化。
我想让您知道这个怪癖，以防您在实践中偶然发现它。

Finally, let’s do the plotting.

最后，让我们作图。

学习曲线–高偏差和低方差 (Learning curves – high bias and low variance)

We plot the learning curves using a regular matplotlib workflow:

我们使用常规的matplotlib工作流程绘制学习曲线：

import matplotlib.pyplot as plt
%matplotlib inlineplt.style.use('seaborn')plt.plot(train_sizes, train_scores_mean, label = 'Training error')
plt.plot(train_sizes, validation_scores_mean, label = 'Validation error')plt.ylabel('MSE', fontsize = 14)
plt.xlabel('Training set size', fontsize = 14)
plt.title('Learning curves for a linear regression model', fontsize = 18, y = 1.03)
plt.legend()
plt.ylim(0,40)

import matplotlib.pyplot as plt
%matplotlib inlineplt.style.use('seaborn')plt.plot(train_sizes, train_scores_mean, label = 'Training error')
plt.plot(train_sizes, validation_scores_mean, label = 'Validation error')plt.ylabel('MSE', fontsize = 14)
plt.xlabel('Training set size', fontsize = 14)
plt.title('Learning curves for a linear regression model', fontsize = 18, y = 1.03)
plt.legend()
plt.ylim(0,40)

There’s a lot of information we can extract from this plot. Let’s proceed granularly.

我们可以从该图中提取很多信息。让我们继续进行。

When the training set size is 1, we can see that the MSE for the training set is 0. This is normal behavior, since the model has no problem fitting perfectly a single data point. So when tested upon the same data point, the prediction is perfect.

当训练集大小为1时，我们可以看到训练集的MSE为0。这是正常现象，因为模型完全可以完美地拟合单个数据点。因此，当在同一数据点上进行测试时，预测是完美的。

But when tested on the validation set (which has 1914 instances), the MSE rockets up to roughly 423.4. This relatively high value is the reason we restrict the y-axis range between 0 and 40. This enables us to read most MSE values with precision. Such a high value is expected, since it’s extremely unlikely that a model trained on a single data point can generalize accurately to 1914 new instances it hasn’t seen in training.

但是，当在验证集（具有1914个实例）上进行测试时，MSE会猛增至大约423.4。此相对较高的值是我们将y轴范围限制在0到40之间的原因。这使我们能够精确读取大多数MSE值。预期会有如此之高的价值，因为在单个数据点上训练的模型不太可能能够准确地推广到训练中未见的1914个新实例。

When the training set size increases to 100, the training MSE increases sharply, while the validation MSE decreases likewise. The linear regression model doesn’t predict all 100 training points perfectly, so the training MSE is greater than 0. However, the model performs much better now on the validation set because it’s estimated with more data.

当训练集大小增加到100时，训练MSE急剧增加，而验证MSE同样减少。线性回归模型无法完美预测所有100个训练点，因此训练MSE大于0。但是，由于使用更多数据进行估算，该模型现在在验证集上的表现要好得多。

From 500 training data points onward, the validation MSE stays roughly the same. This tells us something extremely important: adding more training data points won’t lead to significantly better models. So instead of wasting time (and possibly money) with collecting more data, we need to try something else, like switching to an algorithm that can build more complex models.

从500个训练数据点开始，验证MSE大致保持不变。这告诉我们一些非常重要的事情：添加更多的训练数据点不会导致明显更好的模型。因此，除了浪费时间（可能是金钱）来收集更多数据之外，我们还需要尝试其他事情，例如切换到可以构建更复杂模型的算法。

To avoid a misconception here, it’s important to notice that what really won’t help is adding more instances (rows) to the training data. Adding more features, however, is a different thing and is very likely to help because it will increase the complexity of our current model.

为了避免在这里产生误解，必须注意，真正无济于事的是向训练数据中添加更多实例（行）。但是，添加更多功能是另一回事，并且很有可能会有所帮助，因为这将增加当前模型的复杂性。

Let’s now move to diagnosing bias and variance. The main indicator of a bias problem is a high validation error. In our case, the validation MSE stagnates at a value of approximately 20. But how good is that? We’d benefit from some domain knowledge (perhaps physics or engineering in this case) to answer this, but let’s give it a try.

现在让我们开始诊断偏差和方差。偏差问题的主要指标是较高的验证误差。在我们的案例中，验证MSE停滞在大约20的值。但是，这有多好？我们将从某些领域的知识（在这种情况下可能是物理或工程学）中受益，来回答这个问题，但是让我们尝试一下。

Technically, that value of 20 has MW$^2$ (megawatts squared) as units (the units get squared as well when we compute the MSE). But the values in our target column are in MW (according to the documentation). Taking the square root of 20 MW$^2$ results in approximately 4.5 MW. Each target value represents net hourly electrical energy output. So for each hour our model is off by 4.5 MW on average. According to this Quora answer, 4.5 MW is equivalent to the heat power produced by 4500 handheld hair dryers. And this would add up if we tried to predict the total energy output for one day or a longer period.

从技术上讲，该值20具有MW $ ^ 2 $（兆瓦平方）作为单位（当我们计算MSE时，单位也会平方）。但是我们的目标列中的值以MW为单位（根据文档）。取20 MW $ ^ 2 $的平方根将得出大约4.5 MW。每个目标值代表每小时净输出电能。因此，每小时我们的模型平均减少4.5 MW。根据Quora的回答，4.5 MW相当于4500台手持式吹风机产生的热能。如果我们试图预测一天或更长时间的总能量输出，这将加起来。

We can conclude that the an MSE of 20 MW$^2$ is quite large. So our model has a bias problem. But is it a low bias problem or a high bias problem?

我们可以得出结论，20 MW $ ^ 2 $的MSE非常大。因此我们的模型存在偏差问题。但这是低偏差问题还是高偏差问题？

To find the answer, we need to look at the training error. If the training error is very low, it means that the training data is fitted very well by the estimated model. If the model fits the training data very well, it means it has low bias with respect to that set of data.

为了找到答案，我们需要查看训练错误。如果训练误差非常低，则意味着估计模型可以很好地拟合训练数据。如果模型非常适合训练数据，则意味着它相对于该组数据具有较低的偏差。

If the training error is high, it means that the training data is not fitted well enough by the estimated model. If the model fails to fit the training data well, it means it has high bias with respect to that set of data.

如果训练误差高，则意味着估计模型无法很好地拟合训练数据。如果模型无法很好地拟合训练数据，则意味着它对该数据集有很高的偏见。

In our particular case, the training MSE plateaus at a value of roughly 20 MW$^2$. As we’ve already established, this is a high error score. Because the validation MSE is high, and the training MSE is high as well, our model has a high bias problem.

在我们的特定情况下，训练的MSE平稳期约为20 MW $ ^ 2 $。正如我们已经确定的那样，这是一个很高的错误分数。因为验证MSE高，并且训练MSE也高，所以我们的模型存在高偏差问题。

Now let’s move with diagnosing eventual variance problems. Estimating variance can be done in at least two ways:

现在让我们开始诊断最终的方差问题。估计方差可以通过至少两种方式完成：

By examining the gap between the validation learning curve and training learning curve.
By examining the training error: its value and its evolution as the training set sizes increase.

通过检查验证学习曲线和训练学习曲线之间的差距。
通过检查训练错误：随着训练集大小的增加，它的价值及其演变。

A narrow gap indicates low variance. Generally, the more narrow the gap, the lower the variance. The opposite is also true: the wider the gap, the greater the variance. Let’s now explain why this is the case.

狭窄的缺口表示低方差。通常，间隙越窄，方差越小。反之亦然：差距越大，差异越大。现在让我们解释为什么会这样。

As we’ve discussed earlier, if the variance is high, then the model fits training data too well. When training data is fitted too well, the model will have trouble generalizing on data that hasn’t seen in training. When such a model is tested on its training set, and then on a validation set, the training error will be low and the validation error will generally be high. As we change training set sizes, this pattern continues, and the differences between training and validation errors will determine that gap between the two learning curves.

正如我们之前讨论的，如果方差很大，则该模型非常适合训练数据。当训练数据拟合得太好时，模型将难以推广训练中未见的数据。当在训练集上然后在验证集上测试这种模型时，训练误差将很小，而验证误差通常会很高。随着我们更改训练集大小，这种模式将继续，并且训练和验证错误之间的差异将确定两条学习曲线之间的差距。

The relationship between the training and validation error, and the gap can be summarized this way:$$ gap = validation error – training error $$

训练和验证错误之间的关系以及差距可以通过以下方式总结：$$差距=验证错误–训练错误$$

So the bigger the difference between the two errors, the bigger the gap. The bigger the gap, the bigger the variance.

因此，两个误差之间的差异越大，差距就越大。差距越大，差异越大。

In our case, the gap is very narrow, so we can safely conclude that the variance is low.

在我们的案例中，差距非常狭窄，因此我们可以安全地得出结论，方差很小。

High training MSE scores are also a quick way to detect low variance. If the variance of a learning algorithm is low, then the algorithm will come up with simplistic and similar models as we change the training sets. Because the models are overly simplified, they cannot even fit the training data well (they underfit the data). So we should expect high training MSEs. Hence, high training MSEs can be used as indicators of low variance.

训练有素的MSE分数高也是检测低方差的快速方法。如果学习算法的方差很小，那么当我们更改训练集时，该算法将提供简单且相似的模型。由于模型过于简化，因此它们甚至无法很好地拟合训练数据（它们不足以拟合数据）。因此，我们应该期待训练有素的MSE。因此，训练有素的MSE可用作低方差的指标。

In our case, the training MSE plateaus at around 20, and we’ve already concluded that’s a high value. So besides the narrow gap, we now have another confirmation that we have a low variance problem.

在我们的案例中，MSE训练的稳定时间约为20，我们已经得出结论，这是很高的价值。因此，除了狭窄的差距外，我们现在还有另一个确认，那就是我们有一个低方差问题。

So far, we can conclude that:

到目前为止，我们可以得出以下结论：

Our learning algorithm suffers from high bias and low variance, underfitting the training data.
Adding more instances (rows) to the training data is hugely unlikely to lead to better models under the current learning algorithm.

我们的学习算法存在高偏差和低方差的问题，不适合训练数据。
在当前的学习算法下，将更多的实例（行）添加到训练数据中不太可能导致更好的模型。

One solution at this point is to change to a more complex learning algorithm. This should decrease the bias and increase the variance. A mistake would be to try to increase the number of training instances.

此时的一种解决方案是更改为更复杂的学习算法。这将减少偏差并增加方差。一个错误是尝试增加训练实例的数量。

Generally, these other two fixes also work when dealing with a high bias and low variance problem:

通常，在处理高偏差和低方差问题时，这两个其他修复程序也起作用：

Training the current learning algorithm on more features (to avoid collecting new data, you can generate easily polynomial features). This should lower the bias by increasing the model’s complexity.
Decreasing the regularization of the current learning algorithm, if that’s the case. In a nutshell, regularization prevents the algorithm from fitting the training data too well. If we decrease regularization, the model will fit training data better, and, as a consequence, the variance will increase and the bias will decrease.

在更多特征上训练当前的学习算法（为避免收集新数据，您可以轻松生成多项式特征）。这样可以通过增加模型的复杂度来降低偏差。
如果是这样的话，请减少当前学习算法的规则化。简而言之，正则化会阻止算法很好地拟合训练数据。如果减少正则化，则该模型将更好地拟合训练数据，结果，方差将增加且偏差将减小。

学习曲线–低偏差和高方差 (Learning curves – low bias and high variance)

Let’s see how an unregularized Random Forest regressor fares here. We’ll generate the learning curves using the same workflow as above. This time we’ll bundle everything into a function so we can use it for later. For comparison, we’ll also display the learning curves for the linear regression model above.

让我们看看这里的非正规随机森林回归器的价格。我们将使用与上述相同的工作流程来生成学习曲线。这次，我们将所有内容捆绑到一个函数中，以便以后使用。为了进行比较，我们还将显示上面的线性回归模型的学习曲线。

### Bundling our previous work into a function ###def learning_curves(estimator, data, features, target, train_sizes, cv):train_sizes, train_scores, validation_scores = learning_curve(estimator, data[features], data[target], train_sizes = train_sizes,cv = cv, scoring = 'neg_mean_squared_error')train_scores_mean = -train_scores.mean(axis = 1)validation_scores_mean = -validation_scores.mean(axis = 1)plt.plot(train_sizes, train_scores_mean, label = 'Training error')plt.plot(train_sizes, validation_scores_mean, label = 'Validation error')plt.ylabel('MSE', fontsize = 14)plt.xlabel('Training set size', fontsize = 14)title = 'Learning curves for a ' + str(estimator).split('(')[0] + ' model'plt.title(title, fontsize = 18, y = 1.03)plt.legend()plt.ylim(0,40)### Plotting the two learning curves ###from sklearn.ensemble import RandomForestRegressorplt.figure(figsize = (16,5))for model, i in [(RandomForestRegressor(), 1), (LinearRegression(),2)]:plt.subplot(1,2,i)learning_curves(model, electricity, features, target, train_sizes, 5)

### Bundling our previous work into a function ###def learning_curves(estimator, data, features, target, train_sizes, cv):train_sizes, train_scores, validation_scores = learning_curve(estimator, data[features], data[target], train_sizes = train_sizes,cv = cv, scoring = 'neg_mean_squared_error')train_scores_mean = -train_scores.mean(axis = 1)validation_scores_mean = -validation_scores.mean(axis = 1)plt.plot(train_sizes, train_scores_mean, label = 'Training error')plt.plot(train_sizes, validation_scores_mean, label = 'Validation error')plt.ylabel('MSE', fontsize = 14)plt.xlabel('Training set size', fontsize = 14)title = 'Learning curves for a ' + str(estimator).split('(')[0] + ' model'plt.title(title, fontsize = 18, y = 1.03)plt.legend()plt.ylim(0,40)### Plotting the two learning curves ###from sklearn.ensemble import RandomForestRegressorplt.figure(figsize = (16,5))for model, i in [(RandomForestRegressor(), 1), (LinearRegression(),2)]:plt.subplot(1,2,i)learning_curves(model, electricity, features, target, train_sizes, 5)

Now let’s try to apply what we’ve just learned. It’d be a good idea to pause reading at this point and try to interpret the new learning curves yourself.

现在，让我们尝试应用刚刚学到的知识。这时最好暂停阅读，然后尝试自己解释新的学习曲线。

Looking at the validation curve, we can see that we’ve managed to decrease bias. There still is some significant bias, but not that much as before. Looking at the training curve, we can deduce that this time there’s a low bias problem.

查看验证曲线，我们可以看到我们已经减少了偏差。仍然存在一些明显的偏见，但没有以前那么多。观察训练曲线，我们可以推断出这次存在一个低偏差问题。

The new gap between the two learning curves suggests a substantial increase in variance. The low training MSEs corroborate this diagnosis of high variance.

两条学习曲线之间的新差距表明方差显着增加。低训练水平的MSE证实了这种高方差的诊断。

The large gap and the low training error also indicates an overfitting problem. Overfitting happens when the model performs well on the training set, but far poorer on the test (or validation) set.

较大的差距和较低的训练误差也表示过度拟合的问题。当模型在训练集上表现良好但在测试（或验证）集上表现较差时，就会发生过度拟合。

One more important observation we can make here is that adding new training instances is very likely to lead to better models. The validation curve doesn’t plateau at the maximum training set size used. It still has potential to decrease and converge toward the training curve, similar to the convergence we see in the linear regression case.

我们在这里可以做的另一项重要观察是，添加新的训练实例很可能会导致更好的模型。验证曲线不会稳定在所使用的最大训练集大小上。它仍然有可能减小并收敛到训练曲线，类似于我们在线性回归情况下看到的收敛。

So far, we can conclude that:

到目前为止，我们可以得出以下结论：

Our learning algorithm (random forests) suffers from high variance and quite a low bias, overfitting the training data.
Adding more training instances is very likely to lead to better models under the current learning algorithm.

我们的学习算法（随机森林）具有较高的方差和相当低的偏差，因此过度拟合了训练数据。
在当前的学习算法下，添加更多的训练实例很可能会导致更好的模型。

At this point, here are a couple of things we could do to improve our model:

在这一点上，我们可以做一些事情来改善我们的模型：

Adding more training instances.
Increase the regularization for our current learning algorithm. This should decrease the variance and increase the bias.
Reducing the numbers of features in the training data we currently use. The algorithm will still fit the training data very well, but due to the decreased number of features, it will build less complex models. This should increase the bias and decrease the variance.

添加更多的训练实例。
增加我们当前学习算法的正则化。这将减少方差并增加偏差。
减少我们当前使用的训练数据中的特征数量。该算法仍将很好地拟合训练数据，但是由于特征数量的减少，它将建立较简单的模型。这将增加偏差并减小方差。

In our case, we don’t have any other readily available data. We could go into the power plant and take some measurements, but we’ll save this for another post (just kidding).

在我们的案例中，我们没有其他任何可用的数据。我们可以去电厂进行一些测量，但是我们将其保存在另一个帖子中（只是在开玩笑）。

Let’s rather try to regularize our random forests algorithm. One way to do that is to adjust the maximum number of leaf nodes in each decision tree. This can be done by using the max_leaf_nodes parameter of RandomForestRegressor(). It’s not necessarily for you to understand this regularization technique. For our purpose here, what you need to focus on is the effect of this regularization on the learning curves.

让我们尝试规范化我们的随机森林算法。一种方法是调整每个决策树中叶节点的最大数量。这可以通过使用RandomForestRegressor()的max_leaf_nodes参数来完成。您不一定必须了解这种正则化技术。对于我们这里的目的，您需要关注的是这种正则化对学习曲线的影响。

Not bad! The gap is now more narrow, so there’s less variance. The bias seems to have increased just a bit, which is what we wanted.

不错！差距现在更窄了，因此变化也更少了。偏见似乎有所增加，这就是我们想要的。

But our work is far from over! The validation MSE still shows a lot of potential to decrease. Some steps you can take toward this goal include:

但是我们的工作还远远没有结束！验证的MSE仍显示出很大的降低潜力。您可以为实现该目标而采取的一些步骤包括：

Adding more training instances.
Adding more features.
Feature selection.
Hyperparameter optimization.

添加更多的训练实例。
添加更多功能。
功能选择。
超参数优化。

理想的学习曲线和不可减少的误差 (The ideal learning curves and the irreducible error)

Learning curves constitute a great tool to do a quick check on our models at every point in our machine learning workflow. But how do we know when to stop? How do we recognize the perfect learning curves?

学习曲线是在机器学习工作流中的每个点快速检查模型的有效工具。但是我们怎么知道什么时候停止？我们如何识别完美的学习曲线？

For our regression case before, you might think that the perfect scenario is when both curves converge toward an MSE of 0. That’s a perfect scenario, indeed, but, unfortunately, it’s not possible. Neither in practice, neither in theory. And this is because of something called irreducible error.

对于我们之前的回归案例，您可能会认为理想的情况是两条曲线都朝着MSE 0收敛。这确实是理想的情况，但不幸的是，这是不可能的。既不实践，也不理论。这是由于所谓的不可减少的错误。

When we build a model to map the relationship between the features $X$ and the target $Y$, we assume that there is such a relationship in the first place. Provided the assumption is true, there is a true model $f$ that describes perfectly the relationship between $X$ and $Y$, like so:

当我们构建模型以映射特征$ X $和目标$ Y $之间的关系时，我们假设首先存在这种关系。假设假设是正确的，则有一个真实的模型$ f $可以完美地描述$ X $和$ Y $之间的关系，如下所示：

$$ Y = f(X) + irreducible error (1)$$

$$ Y = f（X）+不可约误差（1）$$

But why is there an error?! Haven’t we just said that $f$ describes the relationship between X and Y perfectly?!

但是为什么会有错误呢？我们不是刚刚说过$ f $完美地描述了X和Y之间的关系吗？

There’s an error there because $Y$ is not only a function of our limited number of features $X$. There could be many other features that influence the value of $Y$. Features we don’t have. It might also be the case that $X$ contains measurement errors. So, besides $X$, $Y$ is also a function of $irreducible error$.

这是有错误的，因为$ Y $不仅是我们有限数量的功能$ X $的函数。可能还有许多其他因素会影响$ Y $的价值。我们没有的功能。 $ X $可能包含测量错误。因此，除了$ X $之外，$ Y $也是$ irreducible error $的函数。

Now let’s explain why this error is irreducible. When we estimate $f(X)$ with a model $hat{f}(X)$, we introduce another kind of error, called reducible error:$$ f(X) = hat{f}(X) + reducible error (2)$$

现在，让我们解释一下为什么此错误是不可减少的。当我们用模型$ hat {f}（X）$估计$ f（X）$时，我们引入了另一种误差，称为可归约误差：$$ f（X）= hat {f}（X）+可归约误差（2）$$

Replacing $f(X)$ in $(1)$ we get:$$ Y = hat{f}(X) + reducible error + irreducible error (3)$$

将$ f（X）$替换为$（1）$，我们得到：$$ Y =帽子{f}（X）+可减少的误差+不可减少的误差（3）$$

Error that is reducible can be reduced by building better models. Looking at equation $(2)$ we can see that if the $reducible error$ is 0, our estimated model $hat{f}(X)$ is equal to the true model $f(X)$. However, from $(3)$ we can see that $irreducible error$ remains in the equation even if $reducible error$ is 0. From here we deduce that no matter how good our model estimate is, generally there still is some error we cannot reduce. And that’s why this error is considered irreducible.

通过建立更好的模型可以减少可减少的误差。查看方程$（2）$，我们可以看到，如果$ reducible error $为0，则我们的估计模型$ hat {f}（X）$等于真实模型$ f（X）$。但是，从$（3）$中我们可以看到，即使$ reducible error $为0，方程中仍然存在$ irreducible error $。从这里我们推论出，无论我们的模型估算值有多好，通常我们仍然会有一些误差不能减少。这就是为什么此错误被认为是不可减少的。

This tells us that that in practice the best possible learning curves we can see are those which converge to the value of some irreducible error, not toward some ideal error value (for MSE, the ideal error score is 0; we’ll see immediately that other error metrics have different ideal error values).

这告诉我们，在实践中，我们可以看到的最佳学习曲线是那些收敛于某些不可减少误差的值，而不是收敛于某些理想误差值的曲线（对于MSE，理想误差得分为0；我们将立即看到其他错误指标具有不同的理想错误值）。

In practice, the exact value of the irreducible error is almost always unknown. We also assume that the irreducible error is independent of $X$. This means that we cannot use $X$ to find the true irreducible error. Expressing the same thing in the more precise language of mathematics, there’s no function $g$ to map $X$ to the true value of the irreducible error:

实际上，不可减少误差的确切值几乎总是未知的。我们还假设不可减少的误差与$ X $无关。这意味着我们不能使用$ X $来找到真正的不可约错误。用更精确的数学语言表达同一件事，没有函数$ g $可以将$ X $映射到不可约误差的真实值：

$$ irreducible error neq g(X)$$

$$不可约误差neq g（X）$$

So there’s no way to know the true value of the irreducible error based on the data we have. In practice, a good workaround is to try to lower the error score as much as possible, while keeping in mind that the limit is given by some irreducible error.

因此，无法根据我们拥有的数据来了解不可减少误差的真实值。在实践中，一个好的解决方法是尝试尽可能降低错误分数，同时要记住，限制是由一些不可减少的错误给定的。

那分类呢？ (What about classification?)

So far, we’ve learned about learning curves in a regression setting. For classification tasks, the workflow is almost identical. The main difference is that we’ll have to choose another error metric – one that is suitable for evaluating the performance of a classifier. Let’s see an example:

到目前为止，我们已经了解了在回归设置中学习曲线的知识。对于分类任务，工作流程几乎相同。主要区别在于，我们将不得不选择另一种误差指标-一种适合评估分类器性能的指标。让我们来看一个例子：

Unlike what we’ve seen so far, notice that the learning curve for the training error is above the one for the validation error. This is because the score used, accuracy, describes how good the model is. The higher the accuracy, the better. The MSE, on the other side, describes how bad a model is. The lower the MSE, the better.

与我们到目前为止所看到的不同，请注意，训练错误的学习曲线高于验证错误的学习曲线。这是因为所使用的分数准确度描述了模型的优良程度。精度越高，越好。另一方面，MSE描述了模型的严重程度。 MSE越低越好。

This has implications for the irreducible error as well. For error metrics that describe how bad a model is, the irreducible error gives a lower bound: you cannot get lower than that. For error metrics that describe how good a model is, the irreducible error gives an upper bound: you cannot get higher than that.

这对于不可减少的误差也有影响。对于描述模型有多糟糕的错误度量标准，不可减少的错误给出了一个下限：您不能低于这个范围。对于描述模型的好坏的度量指标，不可减少的误差给出了一个上限：您无法获得更高的误差。

As a side note here, in more technical writings the term Bayes error rate is what’s usually used to refer to the best possible error score of a classifier. The concept is analogous to the irreducible error.

作为此处的注释，在更多的技术著作中，术语贝叶斯错误率通常是指分类器的最佳错误评分。这个概念类似于不可减少的误差。

下一步 (Next steps)

Learning curves constitute a great tool to diagnose bias and variance in any supervised learning algorithm. We’ve learned how to generate them using scikit-learn and matplotlib, and how to use them to diagnose bias and variance in our models.

学习曲线是诊断任何监督学习算法中偏差和方差的理想工具。我们已经学习了如何使用scikit-learn和matplotlib生成它们，以及如何使用它们来诊断模型中的偏差和方差。

Generate learning curves for a regression task using a different data set.
Generate learning curves for a classification task.
Generate learning curves for a supervised learning task by coding everything from scratch (don’t use learning_curve() from scikit-learn). Using cross-validation is optional.
Compare learning curves obtained without cross-validating with curves obtained using cross-validation. The two kinds of curves should be for the same learning algorithm.

使用不同的数据集为回归任务生成学习曲线。
生成分类任务的学习曲线。
通过从头开始编写所有内容来生成监督学习任务的学习曲线（不要使用scikit-learn的learning_curve() ）。使用交叉验证是可选的。
将未经交叉验证的学习曲线与使用交叉验证的曲线进行比较。两种曲线应用于相同的学习算法。

翻译自: https://www.pybloggers.com/2018/01/learning-curves-for-machine-learning/

学习曲线机器学习