准确测量模型预测误差

Accurately Measuring Model Prediction Error

准确测量模型预测误差
May 2012
When assessing the quality of a model, being able to accurately measure its prediction error is of key importance. Often, however, techniques of measuring error are used that give grossly misleading results. This can lead to the phenomenon of over-fitting where a model may fit the training data very well, but will do a poor job of predicting results for new data not used in model training. Here is an overview of methods to accurately measure model prediction error.
在评估模型的质量时，准确测量其预测误差至关重要。但是，通常会使用测量误差的技术来产生严重的误导性结果。这可能会导致过度拟合的现象，其中模型可以很好地拟合训练数据，但是对于模型训练中未使用的新数据，结果的预测效果会很差。这是准确测量模型预测误差的方法的概述。

1 Measuring Error

测量误差

When building prediction models, the primary goal should be to make a model that most accurately predicts the desired target value for new data. The measure of model error that is used should be one that achieves this goal. In practice, however, many modelers instead report a measure of model error that is based not on the error for new data but instead on the error the very same data that was used to train the model. The use of this incorrect error measure can lead to the selection of an inferior and inaccurate model.
建立预测模型时，主要目标应该是建立一个最准确地预测新数据所需目标值的模型。使用的模型误差的度量应该是达到该目标的度量。但是，实际上，许多建模者都报告了模型误差的度量，该度量不是基于新数据的误差，而是基于误差来训练模型所使用的完全相同的数据。使用此错误的错误度量可能会导致选择劣质和不准确的模型。

Naturally, any model is highly optimized for the data it was trained on. The expected error the model exhibits on new data will always be higher than that it exhibits on the training data. As example, we could go out and sample 100 people and create a regression model to predict an individual’s happiness based on their wealth. We can record the squared error for how well our model does on this training set of a hundred people. If we then sampled a different 100 people from the population and applied our model to this new group of people, the squared error will almost always be higher in this second case.
自然，任何模型都针对其训练的数据进行了高度优化。模型在新数据上显示的预期误差将始终高于在训练数据上显示的预期误差。例如，我们可以出去对100个人进行抽样，并创建回归模型以根据个人的财富预测其幸福感。我们可以记录平方误差，以证明我们的模型在此一百人的训练集上的表现如何。然后，如果我们从总体中抽取了另外100个人作为样本，并将我们的模型应用于这一新人群，那么在第二种情况下，平方误差几乎总是更高。

It is helpful to illustrate this fact with an equation. We can develop a relationship between how well a model predicts on new data (its true prediction error and the thing we really care about) and how well it predicts on the training data (which is what many modelers in fact measure).
用方程式说明这个事实是有帮助的。我们可以在模型对新数据的预测能力（其真正的预测误差和我们真正关心的事情）与训练数据的预测能力（这实际上是许多建模者所衡量的）之间建立关系。

True Prediction Error=Training Error+Training Optimism
真实预测误差=训练误差+训练乐观

Here, Training Optimism is basically a measure of how much worse our model does on new data compared to the training data. The more optimistic we are, the better our training error will be compared to what the true error is and the worse our training error will be as an approximation of the true error.
在这里，“培训乐观主义”基本上是一种度量，该模型衡量了我们的模型对新数据的影响与培训数据相比有多糟。我们越乐观，我们的训练误差与真实误差的比较就越好，而训练误差作为真实误差的近似值就越差。

1.1 The Danger of Overfitting

过度拟合的危险

In general, we would like to be able to make the claim that the optimism is constant for a given training set. If this were true, we could make the argument that the model that minimizes training error, will also be the model that will minimize the true prediction error for new data. As a consequence, even though our reported training error might be a bit optimistic, using it to compare models will cause us to still select the best model amongst those we have available. So we could in effect ignore the distinction between the true error and training errors for model selection purposes.
总的来说，我们希望能够声称，对于给定的训练集，乐观情绪是恒定不变的。如果这是真的，我们可以提出这样一个论点，即将训练误差最小化的模型，也将是将新数据的真实预测误差最小化的模型。因此，即使我们报告的训练错误可能有点乐观，但使用它来比较模型仍将使我们仍然从可用模型中选择最佳模型。因此，出于模型选择的目的，我们实际上可以忽略真实误差和训练误差之间的区别。

Unfortunately, this does not work. It turns out that the optimism is a function of model complexity: as complexity increases so does optimism. Thus we have a our relationship above for true prediction error becomes something like this:
不幸的是，这是行不通的。事实证明，乐观度是模型复杂度的函数：随着复杂度的增加，乐观度也随之增加。因此，我们在上面有一个关系，真正的预测误差变成了这样的东西：

True Prediction Error=Training Error+f(Model Complexity)
真实预测误差=训练误差+ f（模型复杂度）

How is the optimism related to model complexity? As model complexity increases (for instance by adding parameters terms in a linear regression) the model will always do a better job fitting the training data. This is a fundamental property of statistical models 1. In our happiness prediction model, we could use people’s middle initials as predictor variables and the training error would go down. We could use stock prices on January 1st, 1990 for a now bankrupt company, and the error would go down. We could even just roll dice to get a data series and the error would still go down. No matter how unrelated the additional factors are to a model, adding them will cause training error to decrease.
乐观与模型复杂度有何关系？随着模型复杂度的增加（例如，通过在线性回归中添加参数项），模型将始终在拟合训练数据方面做得更好。这是统计模型的基本属性。在我们的幸福感预测模型中，我们可以使用人的中间缩写作为预测变量，并且训练误差会下降。我们可以将1990年1月1日的股票价格用于一家现已倒闭的公司，这样错误就会减少。我们甚至可以掷骰子来获得数据序列，并且错误仍然会减少。不管其他因素与模型有多不相关，将其添加都会减少训练误差。

But at the same time, as we increase model complexity we can see a change in the true prediction accuracy (what we really care about). If we build a model for happiness that incorporates clearly unrelated factors such as stock ticker prices a century ago, we can say with certainty that such a model must necessarily be worse than the model without the stock ticker prices. Although the stock prices will decrease our training error (if very slightly), they conversely must also increase our prediction error on new data as they increase the variability of the model’s predictions making new predictions worse. Furthermore, even adding clearly relevant variables to a model can in fact increase the true prediction error if the signal to noise ratio of those variables is weak.
但是，与此同时，随着模型复杂度的增加，我们可以看到真实预测准确性（我们真正关心的是）的变化。如果我们建立一个包含一个明显的不相关因素（例如一个世纪前的股票行情价格）的幸福模型，那么我们可以肯定地说，这种模型必然比没有股票行情价格的模型更糟。尽管股票价格会降低我们的训练误差（如果很小），但是相反，它们还必须增加对新数据的预测误差，因为它们会增加模型预测的可变性，从而使新的预测变得更糟。此外，如果这些变量的信噪比很弱，即使将明显相关的变量添加到模型中，实际上也会增加真实的预测误差。

Let’s see what this looks like in practice. We can implement our wealth and happiness model as a linear regression. We can start with the simplest regression possible where Happiness=a+b Wealth+ϵ and then we can add polynomial terms to model nonlinear effects. Each polynomial term we add increases model complexity. So we could get an intermediate level of complexity with a quadratic model like Happiness=a+b Wealth+c Wealth2+ϵ or a high-level of complexity with a higher-order polynomial like Happiness=a+b Wealth+c Wealth2+d Wealth3+e Wealth4+f Wealth5+g Wealth6+ϵ.
让我们看看实际情况。我们可以将我们的财富和幸福模型实现为线性回归。我们可以从Happiness = a + b Wealth + ϵ的最简单回归开始，然后可以添加多项式项以对非线性效应进行建模。我们添加的每个多项式项都会增加模型的复杂性。因此，我们可以使用Happiness = a + b Wealth + c Wealth2 + ϵ这样的二次模型获得中等程度的复杂性，或者使用Happiness = a + b Wealth + c Wealth2 + d这样的高阶多项式来获得高复杂度 Wealth3 + e Wealth4 + f Wealth5 + g Wealth6 + ϵ。

The figure below illustrates the relationship between the training error, the true prediction error, and optimism for a model like this. The scatter plots on top illustrate sample data with regressions lines corresponding to different levels of model complexity.
下图说明了这种模型的训练误差，真实预测误差和乐观度之间的关系。顶部的散点图说明了样本数据，其中的回归线分别对应于模型复杂性的不同级别。

Training, optimism and true prediction error.
训练，乐观和真实的预测错误。

Increasing the model complexity will always decrease the model training error. At very high levels of complexity, we should be able to in effect perfectly predict every single point in the training data set and the training error should be near 0. Similarly, the true prediction error initially falls. The linear model without polynomial terms seems a little too simple for this data set. However, once we pass a certain point, the true prediction error starts to rise. At these high levels of complexity, the additional complexity we are adding helps us fit our training data, but it causes the model to do a worse job of predicting new data.
增加模型复杂度将始终减少模型训练误差。在非常高的复杂度下，我们实际上应该能够完美地预测训练数据集中的每个点，并且训练误差应该接近于0。类似地，真正的预测误差最初会下降。对于该数据集，没有多项式项的线性模型似乎过于简单。但是，一旦经过某个点，真实的预测误差就会开始上升。在如此高的复杂度下，我们添加的额外复杂度有助于我们拟合训练数据，但是这会使模型在预测新数据方面做得更差。

This is a case of overfitting the training data. In this region the model training algorithm is focusing on precisely matching random chance variability in the training set that is not present in the actual population. We can see this most markedly in the model that fits every point of the training data; clearly this is too tight a fit to the training data.
这是过度拟合训练数据的情况。在该区域中，模型训练算法专注于精确匹配实际人群中不存在的训练集中的随机机会变异性。我们可以在适合训练数据每个点的模型中最明显地看到这一点；显然，这太适合训练数据了。

Preventing overfitting is a key to building robust and accurate prediction models. Overfitting is very easy to miss when only looking at the training error curve. To detect overfitting you need to look at the true prediction error curve. Of course, it is impossible to measure the exact true prediction curve (unless you have the complete data set for your entire population), but there are many different ways that have been developed to attempt to estimate it with great accuracy. The second section of this work will look at a variety of techniques to accurately estimate the model’s true prediction error.
防止过度拟合是构建可靠且准确的预测模型的关键。仅查看训练误差曲线时，很容易错过过度拟合。要检测过度拟合，您需要查看真实的预测误差曲线。当然，不可能测量出准确的真实预测曲线（除非您拥有整个人口的完整数据集），但是已经开发出许多不同的方法来尝试以很高的准确性进行估计。这项工作的第二部分将研究各种技术，以准确估计模型的真实预测误差。

1.2 An Example of the Cost of Poorly Measuring Error

错误测量误差的成本示例

Let’s look at a fairly common modeling workflow and use it to illustrate the pitfalls of using training error in place of the true prediction error 2. We’ll start by generating 100 simulated data points. Each data point has a target value we are trying to predict along with 50 different parameters. For instance, this target value could be the growth rate of a species of tree and the parameters are precipitation, moisture levels, pressure levels, latitude, longitude, etc. In this case however, we are going to generate every single data point completely randomly. Each number in the data set is completely independent of all the others, and there is no relationship between any of them.
让我们看一个相当普通的建模工作流程，并用它来说明使用训练误差代替真实的预测误差的陷阱。我们将从生成100个模拟数据点开始。每个数据点都有一个我们尝试预测的目标值以及50个不同的参数。例如，该目标值可以是一棵树的生长速率，参数是降水，湿度，压力水平，纬度，经度等。但是，在这种情况下，我们将完全随机地生成每个数据点。数据集中的每个数字都完全独立于所有其他数字，并且它们之间没有任何关系。

For this data set, we create a linear regression model where we predict the target value using the fifty regression variables. Since we know everything is unrelated we would hope to find an R2 of 0. Unfortunately, that is not the case and instead we find an R2 of 0.5. That’s quite impressive given that our data is pure noise! However, we want to confirm this result so we do an F-test. This test measures the statistical significance of the overall regression to determine if it is better than what would be expected by chance. Using the F-test we find a p-value of 0.53. This indicates our regression is not significant.
对于此数据集，我们创建一个线性回归模型，其中使用五十个回归变量预测目标值。由于我们知道一切都不相关，因此我们希望找到R2为0。不幸的是，情况并非如此，而是找到R2为0.5。鉴于我们的数据纯属噪音，这真是令人印象深刻！但是，我们要确认此结果，因此我们进行F检验。该测试测量整体回归的统计显着性，以确定其是否比偶然预期的要好。使用F检验，我们发现p值为0.53。这表明我们的回归不显着。

If we stopped there, everything would be fine; we would throw out our model which would be the right choice (it is pure noise after all!). However, a common next step would be to throw out only the parameters that were poor predictors, keep the ones that are relatively good predictors and run the regression again. Let’s say we kept the parameters that were significant at the 25% level of which there are 21 in this example case. Then we rerun our regression.
如果我们停在那里，一切都会好起来的。我们将放弃我们的模型，这将是正确的选择（毕竟这是纯噪声！）。但是，通常的下一步是仅丢弃那些预测指标较差的参数，保留那些相对较好的预测指标，然后再次运行回归。假设我们将重要参数保持在25％的水平，在此示例情况下为21。然后，我们重新运行回归。

In this second regression we would find:
• An R2 of 0.36
• A p-value of 5*10-4
• 6 parameters significant at the 5% level
Again, this data was pure noise; there was absolutely no relationship in it. But from our data we find a highly significant regression, a respectable R2 (which can be very high compared to those found in some fields like the social sciences) and 6 significant parameters!
在第二次回归中，我们将发现：

R2为0.36
P值为5 * 10-4
6个参数在5％的水平上很重要
同样，这些数据是纯噪声。绝对没有任何关系。但是从我们的数据中我们发现一个高度显着的回归，一个可观的R2（与社会科学等某些领域的R2相比可能很高）和6个重要参数！

This is quite a troubling result, and this procedure is not an uncommon one but clearly leads to incredibly misleading results. It shows how easily statistical processes can be heavily biased if care to accurately measure error is not taken.
这是一个令人不安的结果，而且这一过程并不罕见，但显然会导致令人难以置信的误导性结果。它显示了如果不注意准确地测量误差，那么统计过程容易被严重偏倚。

2 Methods of Measuring Error

误差测量方法

2. 1 Adjusted R2

调整后的R2

The R2 measure is by far the most widely used and reported measure of error and goodness of fit. R2 is calculated quite simply. First the proposed regression model is trained and the differences between the predicted and observed values are calculated and squared. These squared errors are summed and the result is compared to the sum of the squared errors generated using the null model. The null model is a model that simply predicts the average target value regardless of what the input values for that point are. The null model can be thought of as the simplest model possible and serves as a benchmark against which to test other models. Mathematically:
R2度量是迄今为止使用最广泛且报告最广泛的误差和拟合优度度量。 R2的计算非常简单。首先，对提出的回归模型进行训练，然后计算预测值和观察值之间的差异并进行平方。将这些平方误差求和，然后将结果与使用null模型生成的平方误差之和进行比较。空模型是一种简单地预测平均目标值的模型，而不管该点的输入值是多少。空模型可以被认为是最简单的模型，并且可以作为测试其他模型的基准。数学上：

R2=1−Sum of Squared Errors ModelSum of Squared Errors Null Model

R2 has very intuitive properties. When our model does no better than the null model then R2 will be 0. When our model makes perfect predictions, R2 will be 1. R2 is an easy to understand error measure that is in principle generalizable across all regression models.
R2具有非常直观的属性。当我们的模型没有比零模型好时，R2将为0。当我们的模型做出完美的预测时，R2将为1。R2是一种易于理解的误差度量，原则上可以在所有回归模型中推广。

Commonly, R2 is only applied as a measure of training error. This is unfortunate as we saw in the above example how you can get high R2 even with data that is pure noise. In fact there is an analytical relationship to determine the expected R2 value given a set of n observations and p parameters each of which is pure noise:
通常，R2仅用作训练误差的量度。不幸的是，正如我们在上面的示例中看到的，即使是纯噪声数据，如何也可以获得很高的R2。实际上，给定一组n个观测值和p个参数（每个参数均为纯噪声），存在确定期望的R2值的解析关系：

E[R2]=p/n

So if you incorporate enough data in your model you can effectively force whatever level of R2 you want regardless of what the true relationship is. In our illustrative example above with 50 parameters and 100 observations, we would expect an R2 of 50/100 or 0.5.
因此，如果您在模型中纳入了足够的数据，则无论真正的关系是什么，都可以有效地强制使用所需的R2级别。在上面的示例性示例中，使用50个参数和100个观察值，我们期望R2为50/100或0.5。

One attempt to adjust for this phenomenon and penalize additional complexity is Adjusted R2. Adjusted R2 reduces R2 as more parameters are added to the model. There is a simple relationship between adjusted and regular R2:
调整此现象并惩罚其他复杂性的一种尝试是调整后的R2。随着向模型中添加更多参数，调整后的R2会减小R2。调整后的R2和常规R2之间有一个简单的关系：

Adjusted R2=1−(1−R2)(n−1/n−p−1)

Unlike regular R2, the error predicted by adjusted R2 will start to increase as model complexity becomes very high. Adjusted R2 is much better than regular R2 and due to this fact, it should always be used in place of regular R2. However, adjusted R2 does not perfectly match up with the true prediction error. In fact, adjusted R2 generally under-penalizes complexity. That is, it fails to decrease the prediction accuracy as much as is required with the addition of added complexity.
与常规R2不同，当模型复杂度变得很高时，通过调整R2预测的误差将开始增加。调整后的R2比常规R2好得多，因此，应始终使用它来代替常规R2。但是，调整后的R2不能与真实的预测误差完全匹配。实际上，调整后的R2通常会降低复杂度。即，由于增加了复杂性，因此无法如所需要的那样降低预测精度。

Given this, the usage of adjusted R2 can still lead to overfitting. Furthermore, adjusted R2 is based on certain parametric assumptions that may or may not be true in a specific application. This can further lead to incorrect conclusions based on the usage of adjusted R2.
鉴于此，使用调整后的R2仍会导致过拟合。此外，调整后的R2基于某些参数假设，这些假设在特定应用中可能是正确的，也可能是不正确的。基于调整后的R2的使用，这可能进一步导致错误的结论。

Pros
• Easy to apply
• Built into most existing analysis programs
• Fast to compute
• Easy to interpret 3
Cons
• Less generalizable
• May still overfit the data
优点

容易申请
内置于大多数现有分析程序中
计算速度快
容易解释

缺点

通用性较差
可能仍然适合数据

2.2 Information Theoretic Approaches

信息理论方法

There are a variety of approaches which attempt to measure model error as how much information is lost between a candidate model and the true model. Of course the true model (what was actually used to generate the data) is unknown, but given certain assumptions we can still obtain an estimate of the difference between it and and our proposed models. For a given problem the more this difference is, the higher the error and the worse the tested model is.
有多种方法尝试测量模型误差，因为候选模型和真实模型之间会丢失多少信息。当然，真正的模型（实际用于生成数据的模型）是未知的，但是在某些假设的前提下，我们仍然可以估算出它与我们提出的模型之间的差异。对于给定的问题，这种差异越大，误差越大，并且测试的模型越差。

Information theoretic approaches assume a parametric model. Given a parametric model, we can define the likelihood of a set of data and parameters as the, colloquially, the probability of observing the data given the parameters 4. If we adjust the parameters in order to maximize this likelihood we obtain the maximum likelihood estimate of the parameters for a given model and data set. We can then compare different models and differing model complexities using information theoretic approaches to attempt to determine the model that is closest to the true model accounting for the optimism.
信息理论方法采用参数模型。给定一个参数模型，我们可以将一组数据和参数的可能性定义为通俗地讲就是在给定参数的情况下观察数据的概率。如果我们调整参数以使这种可能性最大化，我们将获得给定模型和数据集的参数的最大可能性估计值。然后，我们可以使用信息理论方法比较不同的模型和不同的模型复杂性，以尝试确定最接近真实模型的模型（考虑到乐观）。

The most popular of these the information theoretic techniques is Akaike’s Information Criteria (AIC). It can be defined as a function of the likelihood of a specific model and the number of parameters in that model:
在这些信息理论技术中，最流行的是赤池的信息标准（AIC）。可以将其定义为特定模型的可能性和该模型中参数数量的函数：

AIC=−2ln(Likelihood)+2p

Like other error criteria, the goal is to minimize the AIC value. The AIC formulation is very elegant. The first part −2ln(Likelihood)) can be thought of as the training set error rate and the second part (2p) can be though of as the penalty to adjust for the optimism.
像其他错误准则一样，目标是最小化AIC值。 AIC的公式非常优雅。可以将第一部分（−2ln（似然））视为训练集的错误率，而将第二部分（2p）视为针对乐观进行调整的代价。

However, in addition to AIC there are a number of other information theoretic equations that can be used. The two following examples are different information theoretic criteria with alternative derivations. In these cases, the optimism adjustment has different forms and depends on the number of sample size (n).
但是，除了AIC之外，还可以使用许多其他信息理论方程式。以下两个示例是具有替代推导的不同信息理论标准。在这些情况下，乐观调整具有不同的形式，并取决于样本大小（n）的数量。

AICc=−2ln(Likelihood)+2p+2p(p+1)/n−p−1
BIC=−2ln(Likelihood)+p ln(n)

The choice of which information theoretic approach to use is a very complex one and depends on a lot of specific theory, practical considerations and sometimes even philosophical ones. This can make the application of these approaches often a leap of faith that the specific equation used is theoretically suitable to a specific data and modeling problem.
选择使用哪种信息理论方法是一个非常复杂的方法，它取决于许多特定的理论，实际的考虑因素，有时甚至是哲学的考虑因素。这可以使这些方法的应用经常有一个信念飞跃，即所使用的特定方程式在理论上适合特定的数据和建模问题。

Pros
• Easy to apply
• Built into most advanced analysis programs
Cons
• Metric not comparable between different applications
• Requires a model that can generate likelihoods 5
• Various forms a topic of theoretical debate within the academic field
优点

容易申请
内置于最高级的分析程序中

缺点

指标在不同应用程序之间不具有可比性
需要可以产生可能性的模型
学术领域内各种形式的理论辩论主题

2.3Holdout Set

保持套

Both the preceding techniques are based on parametric and theoretical assumptions. If these assumptions are incorrect for a given data set then the methods will likely give erroneous results. Fortunately, there exists a whole separate set of methods to measure error that do not make these assumptions and instead use the data itself to estimate the true prediction error.
前述两种技术均基于参数和理论假设。如果这些假设对于给定的数据集不正确，则这些方法可能会给出错误的结果。幸运的是，存在一套完整的独立方法来测量误差，这些方法没有做出这些假设，而是使用数据本身来估计真实的预测误差。

The simplest of these techniques is the holdout set method. Here we initially split our data into two groups. One group will be used to train the model; the second group will be used to measure the resulting model’s error. For instance, if we had 1000 observations, we might use 700 to build the model and the remaining 300 samples to measure that model’s error.
这些技术中最简单的是保持集方法。在这里，我们最初将数据分为两组。一组将用来训练模型；第二组将用于测量结果模型的误差。例如，如果我们有1000个观测值，则可以使用700个来构建模型，而其余300个样本用于测量该模型的误差。

Holdout data split.

This technique is really a gold standard for measuring the model’s true prediction error. As defined, the model’s true prediction error is how well the model will predict for new data. By holding out a test data set from the beginning we can directly measure this.
这项技术实际上是衡量模型真实预测误差的黄金标准。根据定义，模型的真实预测误差是模型对新数据的预测程度。通过从一开始就提供一个测试数据集，我们可以直接进行测量。

The cost of the holdout method comes in the amount of data that is removed from the model training process. For instance, in the illustrative example here, we removed 30% of our data. This means that our model is trained on a smaller data set and its error is likely to be higher than if we trained it on the full data set. The standard procedure in this case is to report your error using the holdout set, and then train a final model using all your data. The reported error is likely to be conservative in this case, with the true error of the full model actually being lower. Such conservative predictions are almost always more useful in practice than overly optimistic predictions.
保留方法的成本来自从模型训练过程中删除的数据量。例如，在此处的说明性示例中，我们删除了30％的数据。这意味着我们的模型是在较小的数据集上训练的，其误差可能会比在完整数据集上训练的误差更高。在这种情况下，标准过程是使用保留集报告您的错误，然后使用所有数据训练最终模型。在这种情况下，报告的误差很可能是保守的，而完整模型的真实误差实际上较低。在实践中，这种保守的预测几乎总是比过于乐观的预测更有用。

One key aspect of this technique is that the holdout data must truly not be analyzed until you have a final model. A common mistake is to create a holdout set, train a model, test it on the holdout set, and then adjust the model in an iterative process. If you repeatedly use a holdout set to test a model during development, the holdout set becomes contaminated. Its data has been used as part of the model selection process and it no longer gives unbiased estimates of the true model prediction error.
该技术的一个关键方面是，只有拥有最终模型后，才可以真正分析保留数据。一个常见的错误是创建保持集，训练模型，在保持集上对其进行测试，然后在迭代过程中调整模型。如果在开发过程中反复使用保留集来测试模型，则保留集会被污染。其数据已用作模型选择过程的一部分，并且不再提供对真实模型预测误差的无偏估计。

Pros
• No parametric or theoretic assumptions
• Given enough data, highly accurate
• Very simple to implement
• Conceptually simple
Cons
• Potential conservative bias
• Tempting to use the holdout set prior to model completion resulting in contamination
• Must choose the size of the holdout set (70%-30% is a common split)
优点

没有参数或理论上的假设
给定足够的数据，高度准确
实施起来很简单
概念上简单

缺点

潜在的保守偏见
尝试在模型完成之前使用保持集，从而导致污染
必须选择保留集的大小（通常为70％-30％）

2.4 Cross-Validation and Resampling

交叉验证和重采样

In some cases the cost of setting aside a significant portion of the data set like the holdout method requires is too high. As a solution, in these cases a resampling based technique such as cross-validation may be used instead.
在某些情况下，像保留方法所要求的那样，保留很大一部分数据集的成本太高。作为解决方案，在这些情况下，可以代替使用诸如交叉验证之类的基于重采样的技术。

Cross-validation works by splitting the data up into a set of n folds. So, for example, in the case of 5-fold cross-validation with 100 data points, you would create 5 folds each containing 20 data points. Then the model building and error estimation process is repeated 5 times. Each time four of the groups are combined (resulting in 80 data points) and used to train your model. Then the 5th group of 20 points that was not used to construct the model is used to estimate the true prediction error. In the case of 5-fold cross-validation you would end up with 5 error estimates that could then be averaged to obtain a more robust estimate of the true prediction error.
交叉验证通过将数据分成一组n折进行工作。因此，例如，在具有100个数据点的5倍交叉验证的情况下，您将创建5个折叠，每个都包含20个数据点。然后，将模型建立和误差估计过程重复5次。每次将四个组组合在一起（得到80个数据点）并用于训练模型。然后将未用于构建模型的第五组20个点用于估计真实的预测误差。在进行5倍交叉验证的情况下，您将得到5个误差估计，然后可以对它们进行平均，以获得更可靠的真实预测误差估计。

5-Fold cross-validation data split.

As can be seen, cross-validation is very similar to the holdout method. Where it differs, is that each data point is used both to train models and to test a model, but never at the same time. Where data is limited, cross-validation is preferred to the holdout set as less data must be set aside in each fold than is needed in the pure holdout method. Cross-validation can also give estimates of the variability of the true error estimation which is a useful feature. However, if understanding this variability is a primary goal, other resampling methods such as Bootstrapping are generally superior.
可以看出，交叉验证与保持方法非常相似。不同之处在于，每个数据点都用于训练模型和测试模型，但绝不能同时使用。在数据有限的情况下，交叉验证优先于保留集，因为在每个折叠中必须保留的数据少于纯保留方法中需要的数据。交叉验证还可以提供对真实错误估计的可变性的估计，这是一个有用的功能。但是，如果了解此可变性是主要目标，则其他重采样方法（例如自举）通常会更好。

On important question of cross-validation is what number of folds to use. Basically, the smaller the number of folds, the more biased the error estimates (they will be biased to be conservative indicating higher error than there is in reality) but the less variable they will be. On the extreme end you can have one fold for each data point which is known as Leave-One-Out-Cross-Validation. In this case, your error estimate is essentially unbiased but it could potentially have high variance. Understanding the Bias-Variance Tradeoff is important when making these decisions. Another factor to consider is computational time which increases with the number of folds. For each fold you will have to train a new model, so if this process is slow, it might be prudent to use a small number of folds. Ultimately, it appears that, in practice, 5-fold or 10-fold cross-validation are generally effective fold sizes.
交叉验证的一个重要问题是要使用多少倍。基本上，折数越少，误差估计就越有偏见（他们将偏向保守，表示比实际误差要大），但可变性就越小。在极端情况下，每个数据点都可以折一折，这就是所谓的“留出一个交叉验证”。在这种情况下，您的误差估计本质上是无偏的，但可能具有很大的方差。在做出这些决定时，了解偏差-方差折衷很重要。要考虑的另一个因素是计算时间，该时间会随折数的增加而增加。对于每个折叠，您都必须训练一个新模型，因此，如果此过程很慢，则使用少量折叠可能是明智的。最终，似乎在实践中，5倍或10倍交叉验证通常是有效的倍数大小。

Pros
• No parametric or theoretic assumptions
• Given enough data, highly accurate
• Conceptually simple
Cons
• Computationally intensive
• Must choose the fold size
• Potential conservative bias
优点

没有参数或理论上的假设
给定足够的数据，高度准确
概念上简单

缺点

计算密集型
必须选择折页尺寸
潜在的保守偏见
做出选择

3 Making a Choice

做出选择

In summary, here are some techniques you can use to more accurately measure model prediction error:
• Adjusted R2
• Information Theoretic Techniques
• Holdout Sample
• Cross-Validation and Resampling Methods
总之，以下是一些可以用来更准确地测量模型预测误差的技术：

调整后的R2
信息理论技术
保持样本
交叉验证和重采样方法

A fundamental choice a modeler must make is whether they want to rely on theoretic and parametric assumptions to adjust for the optimism like the first two methods require. Alternatively, does the modeler instead want to use the data itself in order to estimate the optimism.
建模人员必须做出的基本选择是，他们是否要像前两种方法所要求的那样，依靠理论和参数假设来适应乐观。或者，建模者是否想使用数据本身来估计乐观度。

Generally, the assumption based methods are much faster to apply, but this convenience comes at a high cost. First, the assumptions that underly these methods are generally wrong. How wrong they are and how much this skews results varies on a case by case basis. The error might be negligible in many cases, but fundamentally results derived from these techniques require a great deal of trust on the part of evaluators that this error is small.
通常，基于假设的方法应用起来要快得多，但是这种便利的代价很高。首先，这些方法的基本假设通常是错误的。它们的错误程度以及这种偏斜的结果在不同情况下会有所不同。在许多情况下，该误差可以忽略不计，但是从这些技术中得出的结果从根本上要求评估人员高度信任该误差很小。

Ultimately, in my own work I prefer cross-validation based approaches. Cross-validation provides good error estimates with minimal assumptions. The primary cost of cross-validation is computational intensity but with the rapid increase in computing power, this issue is becoming increasingly marginal. At its root, the cost with parametric assumptions is that even though they are acceptable in most cases, there is no clear way to show their suitability for a specific case. Thus their use provides lines of attack to critique a model and throw doubt on its results. Although cross-validation might take a little longer to apply initially, it provides more confidence and security in the resulting conclusions.
最终，在我自己的工作中，我更喜欢基于交叉验证的方法。交叉验证使用最少的假设即可提供良好的错误估计。交叉验证的主要成本是计算强度，但是随着计算能力的快速提高，这个问题变得越来越微不足道。从根本上讲，带有参数假设的代价是，即使它们在大多数情况下都是可以接受的，也没有明确的方法来显示其对特定情况的适用性。因此，它们的使用提供了攻击路线来批判模型并对其结果产生怀疑。尽管交叉验证最初可能需要花费更长的时间，但它为得出的结论提供了更多的信心和安全性。

❧
Scott Fortmann-Roe

At least statistical models where the error surface is convex (i.e. no local minimums or maximums). If local minimums or maximums exist, it is possible that adding additional parameters will make it harder to find the best solution and training error could go up as complexity is increased. Most off-the-shelf algorithms are convex (e.g. linear and logistic regressions) as this is a very important feature of a general algorithm. ↩
至少误差表面是凸的（即没有局部最小值或最大值）的统计模型。如果存在局部最小值或最大值，则添加其他参数可能会导致更难找到最佳解决方案，并且随着复杂度的增加，训练误差可能会增加。大多数现成的算法都是凸的（例如线性和逻辑回归），因为这是常规算法的一个非常重要的特征。
This example is taken from Freedman, L. S., & Pee, D. (1989). Return to a note on screening regression equations. The American Statistician, 43(4), 279-282. ↩
此示例取自Freedman，L. S.，＆Pee，D.（1989）。返回有关筛选回归方程的注释。美国统计员，43（4），279-282。
Although adjusted R2 does not have the same statistical definition of R2 (the fraction of squared error explained by the model over the null), it is still on the same scale as regular R2 and can be interpreted intuitively. However, in contrast to regular R2, adjusted R2 can become negative (indicating worse fit than the null model). ↩
尽管调整后的R2不具有相同的R2统计定义（由模型解释的平方误差对零值的平方），但它仍与常规R2具有相同的比例，并且可以直观地进行解释。但是，与常规R2相比，调整后的R2可能为负（表明拟合度比空模型差）。
This definition is colloquial because in any non-discrete model, the probability of any given data set is actually 0. If you randomly chose a number between 0 and 1, the change that you draw the number 0.724027299329434… is 0. You will never draw the exact same number out to an infinite number of decimal places. The likelihood is calculated by evaluating the probability density function of the model at the given point specified by the data. To get a true probability, we would need to integrate the probability density function across a range. Since the likelihood is not a probability, you can obtain likelihoods greater than 1. Still, even given this, it may be helpful to conceptually think of likelihood as the “probability of the data given the parameters”; Just be aware that this is technically incorrect! ↩
此定义是口语化的，因为在任何非离散模型中，任何给定数据集的概率实际上为0。如果您随机选择一个介于0和1之间的数字，则绘制数字0.724027299329434 …的更改为0。您将永远不要将完全相同的数字抽取到无数个小数位。通过在数据指定的给定点评估模型的概率密度函数来计算可能性。为了获得真实的概率，我们需要对整个范围内的概率密度函数进行积分。由于可能性不是概率，因此可以获得大于1的可能性。即使如此，将概念上的可能性视为“给定参数的数据的概率”也可能会有所帮助。请注意，这在技术上是不正确的！
This can constrain the types of models which may be explored and will exclude certain powerful analysis techniques such as Random Forests and Artificial Neural Networks. ↩
这可能会限制可能探索的模型类型，并且会排除某些强大的分析技术，例如随机森林和人工神经网络。