评估模型如何建立

There are different types of problems in machine learning. Some might fall under regression (having continuous targets) while others might fall under classification (having discrete targets). Some might not have a target at all where you are just trying to learn the characteristics of data by distinguishing the data points based on their inherent features by creating clusters.

机器学习中存在不同类型的问题。一些可能属于回归(具有连续的目标)，而另一些可能属于分类(具有离散的目标)。有些人可能根本没有目标，而您只是试图通过创建聚类来根据数据点的固有特征来区分数据点，从而学习数据的特征。

However, this article is not about different areas of machine learning but about a very little yet important thing, which if not tended to carefully can wreak “who knows what” on your operationalized classification models and eventually, the business. Therefore, the next time when someone at work tells you that her/his model is giving ~93.23% accuracy, do not fall for it before asking the right questions.

但是，本文不是关于机器学习的不同领域，而是关于一件非常重要的事情，如果不小心的话，它可能会在您的可操作分类模型以及最终业务上造成“谁知道什么”。因此，下一次当有人在工作时告诉您他/他的模型给出了〜93.23％的准确性时，在提出正确的问题之前，请不要掉以轻心。

So, how do we know what are the right questions?

那么，我们怎么知道正确的问题呢？

That’s a good question. Let us try to answer that by studying how to build and evaluate a classification model the right way. Everyone who has been studying machine learning is aware of all frequently used classification metrics but only a few of them know the right one to use to evaluate the performance of their classification model.

这是个好问题。让我们尝试通过研究如何正确构建和评估分类模型来回答这一问题。一直在研究机器学习的每个人都知道所有常用的分类指标，但是只有少数几个知道正确的方法来评估其分类模型的性能。

So, to enable you to ask the right questions, we will go through the following concepts in detail (for a classification model):

因此，为了使您能够提出正确的问题，我们将详细介绍以下概念(针对分类模型)：

Data Distribution (Training, Validation, Test)

数据分配(培训，验证，测试)
Handling Class Imbalance.

处理班级失衡。
The Right Choice of Metric For Model Evaluation

模型评估的正确选择指标

资料分配 (Data Distribution)

While splitting data for training, validation, and test set, one should always keep in mind that all three of them must be representative of the same population. For example, in the MNIST digit classification dataset where all the digit images are grey (black and white), you train it and achieve a validation accuracy of almost 90%, but your test data has digit images of various colors (not just black and white). Now, you have a problem there. No matter what you do, there will always be a data bias. You cannot get rid of it totally but what you can do is maintain a uniformity in your validation and test data set. This is an example of a difference in the distribution of validation and test sets.

在为训练，验证和测试集拆分数据时，应始终牢记，这三个数据必须代表相同的总体。例如，在所有数字图像均为灰色(黑白)的MNIST数字分类数据集中，您对它进行了训练并获得了将近90％的验证准确度，但是您的测试数据具有各种颜色的数字图像(不只是黑色和白色)。白色)。现在，您在那里遇到了问题。无论您做什么，都会有数据偏差。您不能完全摆脱它，但是您可以做的是在验证和测试数据集中保持一致。这是验证和测试集分布差异的一个示例。

拆分数据的正确策略 (The Right Strategy to Split Data)

Your test data set MUST always represent real-world data distribution. For example, in a binary classification problem, where you are supposed to detect positive patients for a rare disease (class 1) where 6% of the entire data set contains positive cases, then your test data should also have almost the same proportion. Make sure you follow the same distribution. This is not just the case with classification models. It holds for every type of ML modeling problem.

您的测试数据集必须始终代表真实的数据分布。例如，在二元分类问题中，您应该检测一种罕见病的阳性患者(1类)，其中整个数据集的6％包含阳性病例，那么您的测试数据也应具有几乎相同的比例。确保遵循相同的分布。这不仅仅是分类模型的情况。它适用于每种类型的ML建模问题。

正确的培训，验证和测试划分顺序 (The Right Sequence for Training, Validation & Test Split)

The test dataset should be extracted first without any data leakage into the leftover data and then, validation data must follow the distribution in the test data. And, what remains after these two splits goes into training. Hence, the right sequence to divide the entire dataset into training, validation, and test sets is to get the test, validation, and training set specifically in that order from the entire dataset.

首先应提取测试数据集，而不会将任何数据泄漏到剩余数据中，然后，验证数据必须遵循测试数据中的分布。而且，在这两次分裂之后剩下的仍然要接受训练。因此，将整个数据集划分为训练集，验证集和测试集的正确顺序是从整个数据集中按照该顺序专门获得测试，验证集和训练集。

正确的比例 (The Right Proportion)

There is a convenience of 70–20–10 split in the machine learning community but that is only when you have an average amount of data. If you are working on, for instance, an image classification problem and you have ~10 million of images, then doing a 70–20–10 split would be a bad idea because the amount of data is so huge that to validate your model, even 1 to 2% of it is enough. Hence, I would rather go with 96–2–2 split because you do not want to increase the unnecessary overhead on validation and test by increasing the size as the same representation of distribution can be achieved using 2% of the total data in the validation and test. Also, make sure you do not sample with a replacement while making splits.

机器学习社区提供70–20–10的便利，但这仅在您拥有平均数据量的情况下。例如，如果您正在处理图像分类问题，并且有大约1000万张图像，那么进行70–20–10的拆分将是一个坏主意，因为数据量非常庞大，无法验证模型，甚至只有1到2％就足够了。因此，我宁愿进行96–2–2拆分，因为您不希望通过增加大小来增加验证和测试的不必要开销，因为使用验证中总数据的2％可以实现相同的分布表示并测试。另外，请确保在进行分割时不要使用替代品进行采样。

处理班级失衡 (Handling Class Imbalance)

In case of any classification problem, which affects the performance of a model most is the amount of loss contributed by every class to the total cost. The higher the number of examples of a certain class, the total loss contribution of that class is higher. The loss contribution to the total cost by a class is directly proportional to the number of examples belonging to that class. In this way, the classifier concentrates more on classifying those instances correctly which are contributing more to the total cost of the loss function (i.e. the instances from the majority class).

如果发生任何分类问题，而这会影响模型的性能，则最主要的是每个类对总成本造成的损失。某个类别的示例数量越多，该类别的总损失贡献就越高。一个类别对总成本的损失贡献与该类别的示例数成正比。这样，分类器将更多的精力放在正确分类那些对损失函数的总成本贡献更大的实例上(即，来自多数类的实例)。

Following are the ways using which we can tackle class imbalance:

以下是我们解决班级失衡的方法：

Weighted Loss

加权损失
Resampling

重采样

加权损失 (Weighted Loss)

In binary cross-entropy loss, we have the following loss function:

在二进制交叉熵损失中，我们具有以下损失函数：

The model outputs the probability that the given example belongs to a positive (y=1) class. And, based on the above binary cross-entropy loss function, loss value is computed per example, and finally, the total cost is computed as the average loss across all examples. Let us conduct a simple simulation to understand it better by writing a simple python script. Let’s generate 100 ground truth labels, 25 out of which belong to the positive (y=1) class, and the rest are negative (y=0) to account for the class imbalance in our tiny experiment. Also, we will generate a random probability value of it belonging to the positive class for every example.

模型输出给定示例属于正(y = 1)类的概率。并且，基于以上的二进制交叉熵损失函数，每个示例计算损失值，最后，将总成本计算为所有示例的平均损失。让我们进行一个简单的模拟，以通过编写一个简单的python脚本更好地理解它。让我们生成100个地面真相标签，其中25个属于正(y = 1)类，其余的为负(y = 0)，以解决我们的小型实验中的类不平衡问题。同样，我们将为每个示例生成一个属于正类的随机概率值。

import numpy as npimport random# Generating Ground truth labels and Predicted probabilitiestruth, probs = [], []for i in range(100):    # To maintain class imbalance    if i < 25:        truth.append(1)    else:        truth.append(0)    probs.append(round(random.random(),2))print("Total Positive Example Count: ",sum(truth))print("Total Negative Example Count: ",len(truth) - sum(truth))print("Predicted Probability Values: ",probs)Output:Total Positive Example Count:  25Total Negative Example Count:  75Predicted Probability Values:  [0.84, 0.65, 0.11, 0.21, 0.31, 0.05, 0.44, 0.83, 0.19, 0.61, 0.28, 0.36, 0.46, 0.79, 0.74, 0.58, 0.65, 0.8, 0.05, 0.39, 0.08, 0.45, 0.4, 0.03, 0.41, 0.75, 0.46, 0.49, 0.94, 0.57, 0.38, 0.7, 0.07, 0.91, 0.85, 0.91, 0.72, 0.28, 0.0, 0.55, 0.61, 0.55, 0.81, 0.98, 0.9, 0.36, 0.65, 0.91, 0.26, 0.1, 0.99, 0.48, 0.34, 0.96, 0.68, 0.21, 0.28, 0.37, 0.8, 0.27, 0.87, 0.93, 0.03, 0.95, 0.25, 0.63, 0.2, 0.45, 0.05, 0.7, 0.91, 0.85, 0.56, 0.61, 0.4, 0.35, 0.6, 0.27, 0.08, 0.85, 0.14, 0.82, 0.22, 0.41, 0.85, 0.72, 0.91, 0.5, 0.55, 0.89, 0.39, 0.92, 0.24, 0.07, 0.52, 0.88, 0.01, 0.01, 0.01, 0.31]

Now, that we have ground truth labels and predicted probabilities, using the above loss function, we can compute the total loss contribution by both the classes. A really small number was added to the predicted probabilities before calculating the log value to avoid error due to undefined value. [log(0) = undefined]

现在，我们有了基本的事实标签和预测的概率，使用上面的损失函数，我们可以计算两个类的总损失贡献。在计算对数值之前，将非常小的数字添加到预测的概率中，以避免由于不确定的值而导致错误。 [log(0)=未定义]

# Calculating Plain Binary Cross-Entropy Losspos_loss, neg_loss = 0, 0for i in range(len(truth)):    # Applying the binary cross-entropy loss function    if truth[i] == 1:        pos_loss += -1 * np.log(probs[i] + 1e-7)    else:        neg_loss += -1 * np.log(1 - probs[i] + 1e-7)print("Positive Class Loss: ",round(pos_loss,2))print("Negative Class Loss: ",round(neg_loss,2))Output:Positive Class Loss:  29.08Negative Class Loss:  83.96

As we can see that the total loss over both the classes has a huge difference and the negative class is leading the race of loss contribution, the algorithm is technically going to focus more on the negative class to decrease loss radically while minimizing it. That is when we fool the model into believing what is not real by assigning a weight to total loss calculation by using the following weighted loss function:

正如我们所看到的，两种类别的总损失有很大的差异，负类别主导着损失贡献的竞争，该算法在技术上将更多地集中于负类别，以从根本上减少损失，同时将损失最小化。那就是当我们通过使用以下加权损失函数为总损失计算分配权重来使模型相信不真实的情况时：

Here, ‘Wp’ & ‘Wn’ are weights assigned to positive and negative class loss respectively and can be calculated as follows:

在这里，“ Wp”和“ Wn”分别是分配给正类损失和负类损失的权重，可以如下计算：

Wp = total number of negative (y=0) examples / total examples

Wp =负(y = 0)个示例/示例总数

Wn = total number of positive (y=1) examples / total examples

Wn =正样本总数(y = 1)/样本总数

Now, let us calculate weighted loss by adding the weights to the calculation:

现在，让我们通过将权重添加到计算中来计算加权损失：

# Calculating Weighted Binary Cross-Entropy Losspos_loss, neg_loss = 0, 0# Wp (Weight for positive class)wp = (len(truth) - sum(truth))/len(truth)# Wn (Weight for negative class)wn = sum(truth) / len(truth)for i in range(len(truth)):    # Applying the same function with class weights.    if truth[i] == 1:        pos_loss += -wp * np.log(probs[i] + 1e-7)    else:        neg_loss += -wn * np.log(1 - probs[i] + 1e-7)print("Positive Class Loss: ",round(pos_loss,2))print("Negative Class Loss: ",round(neg_loss,2))Output:Positive Class Loss:  21.81Negative Class Loss:  20.99

Amazing! Isn’t it? We managed to reduce the difference of loss contribution between both classes significantly by assigning the right weights.

惊人！是不是通过分配正确的权重，我们设法显着减少了两个类别之间的损失贡献差异。

重采样 (Resampling)

This is yet another technique using which you can counter class imbalance but this should not be the first technique you use. Resampling can be done in three ways:

这是另一种可以用来抵消类不平衡的技术，但这不应该是您使用的第一项技术。重采样可以通过三种方式完成：

Either by oversampling the minority class

通过对少数群体的过度采样
Or by undersampling the majority class

或对多数阶层进行低估
Or both by the right amount

或两者都适当

Oversampling can be achieved either by random sampling the minority class with the replacement or by synthetically generating more examples by using techniques such as SMOTE. Oversampling can help up to a limit because, after a certain amount, you are duplicating the information contained in the data. It might give you ideal loss contributions from both the classes but will fail at validation and test time. But, if you have a massive amount of data along with imbalance, you should go for undersampling without replacement of majority class.

可以通过用替代品对少数族裔进行随机抽样或通过使用SMOTE等技术综合生成更多示例来实现超采样。过度采样最多可以提供一个限制，因为经过一定数量后，您将复制数据中包含的信息。这可能会给您两种类别的理想损耗贡献，但在验证和测试时将失败。但是，如果您有大量的数据且不平衡，则应进行欠采样，而不必替换多数类。

Sometimes, people do use both the techniques at once when there is an average amount of data, and class imbalance is not huge. So, they oversample the minority class and undersample the majority class by a certain calculated amount to achieve balance.

有时，人们在平均数据量时确实会同时使用这两种技术，并且班级不平衡并不大。因此，他们对少数群体进行了过度采样，而对少数群体进行了过低采样以达到一定的计算量，从而达到平衡。

Now, you understand when somebody comes to you and says I have got ~93.23% accuracy, you should think and ask about the class proportions in the data and the type of loss function used. Then, you should wonder whether measuring just accuracy is the right way to go. Or there is something more!

现在，您了解了有人向您说我的准确度达到93.23％时，您应该考虑并询问数据中的类比例以及所使用的损失函数的类型。然后，您应该怀疑仅测量精度是否是正确的方法。或者还有更多！

公制的正确选择 (The Right Choice of Metric)

There is always something more at least when you are working on a machine learning model but to know when you want more is only possible when you have something to compare to. A Benchmark! Once you have a benchmark, you know how much improvement you want.

至少在使用机器学习模型时，总会有更多的东西，但是只有当您有比较的东西时才知道何时需要更多的东西。基准！一旦有了基准，就知道需要多少改进。

But to improve the performance, you need to know which metric is the right indicator of performance in the business problem you are trying to solve. For example, if you are trying to solve a tumor detection problem where the objective is to detect whether the tumor is malignant (y=1) or benign (y=0). In this case, you need to understand that in real word benign cases are more than the malignant cases. Hence, when you get the data, you will have a good amount of class imbalance (unless of course, you are really lucky). So, accuracy as a metric is out of the question. Now, the question is, what is more important? To detect whether a patient has a malignant tumor or a benign one. This is a business decision and you should always consult a domain expert(s) (in this case, expert doctors) to understand the business problem by asking such questions. If we are more concerned with detecting malignant tumors effectively even if we have a few false positives (Ground Truth: Benign, Model Prediction: Malignant) but we need as minimum false negatives (Ground Truth: Malignant, Model Prediction: Benign) as possible, then Recall should be our metric of choice but if it is vice-versa (which can never be in this particular case), Precision should be our choice.

但是，要提高性能，您需要知道哪个指标是您要解决的业务问题中正确的绩效指标。例如，如果您要解决肿瘤检测问题，而目标是检测肿瘤是恶性(y = 1)还是良性(y = 0)。在这种情况下，您需要了解，实际上，良性情况比恶性情况还多。因此，当您获取数据时，您将有大量的班级失衡(当然，除非您真的很幸运)。因此，将准确性作为度量标准是不可能的。现在的问题是，更重要的是什么？检测患者是否患有恶性肿瘤或良性肿瘤。这是一项业务决策，您应始终咨询域专家(在这种情况下为专家医生)，通过询问此类问题来了解业务问题。如果即使我们有一些假阳性(基本事实：良性，模型预测：恶性)，我们更关注有效地检测恶性肿瘤，但我们需要尽可能少的假阴性(基本事实：恶性，模型预测：良性)，那么“召回率”应该是我们的选择指标，但是反之亦然(在这种情况下绝对不能如此)，“精度”应该是我们的选择。

Sometimes, in a business setting, there are problems where you need effective classifications on both the classes and hence you would want to optimize the F1 score. To tackle this trade-off, we should work on maximizing the area under the precision-recall curve as much as possible.

有时，在业务环境中，存在一些问题，您需要在两个类别上都进行有效分类，因此您需要优化F1分数。为了解决这种折衷，我们应该努力使最大精度调用曲线下的面积最大化。

Also, the results must be conveyed in terms of confidence intervals with upper and lower bounds of the metric to get a fair idea of the behavior of the model on the population using all the experiments conducted over various samples.

而且，必须使用度量的上限和下限在置信区间上传达结果，以便使用对各种样本进行的所有实验来更好地了解模型在总体上的行为。

To summarize it all, the following are the major takeaways from this article:

总结一下，以下是本文的主要内容：

Data distribution is crucial when building a classification model and one should always start by getting their test distribution right first and then validation and train in that order.在建立分类模型时，数据分发至关重要，应该始终首先正确地进行测试分发，然后再按照该顺序进行验证和训练。
Class imbalance should be handled properly to avoid really bad results on live data.应该正确处理类的不平衡，以避免对实时数据产生非常糟糕的结果。
Only when you select the right metric for evaluation of your model, you can assess its performance correctly. There are a lot of factors ranging from business expertise to technicalities of the model itself that help us decide the right metric.仅当选择正确的度量标准来评估模型时，您才能正确评估其性能。从业务专业知识到模型本身的技术性，都有很多因素可以帮助我们确定正确的指标。

Thank you for reading the article.

感谢您阅读这篇文章。

翻译自: https://towardsdatascience.com/building-and-evaluating-classification-ml-models-9c3f45038ef4

评估模型如何建立

查看全文

http://www.taodudu.cc/news/show-863471.html

介绍神经网络_神经网络介绍
人物肖像速写_深度视频肖像
奇异值值分解。svd_推荐系统-奇异值分解（SVD）和截断SVD
机器学习对模型进行惩罚_使用Streamlit对机器学习模型进行原型制作
神经网络实现xor_在神经网络中实现逻辑门和XOR解决方案
sagan 自注意力_请使用英语：自我注意生成对抗网络（SAGAN）
pytorch 音频分类_Pytorch中音频的神经风格转换
变压器 5g_T5：文本到文本传输变压器
演示方法：有抱负的分析师
机器学习模型性能评估_如何评估机器学习模型的性能
深度学习将灰度图着色_通过深度学习为视频着色
工业机器人入门实用教程_机器学习实用入门
facebook 图像比赛_使用Facebook的Detectron进行图像标签
营销大数据分析关键技术_营销分析的3个最关键技能
ue4 gpu构建_待在家里吗为什么不构建GPU Box！
使用机器学习预测天气_使用机器学习的二手车价格预测
python集群_使用Python集群文档
马尔可夫的营销归因
使用Scikit-learn，Spotify API和Tableau Public进行无监督学习
街景图像分割_借助深度学习和街景图像进行城市的大规模树木死亡率研究
多目标分类的混淆矩阵_用于目标检测的混淆矩阵
检测和语义分割_分割和对象检测-第2部分
watson软件使用_使用Watson Assistant进行多语言管理
keras核心已转储_转储Keras-ImageDataGenerator。开始使用TensorFlow-tf.data（第2部分）
闪亮蔚蓝_在R中构建第一个闪亮的Web应用
亚马逊训练alexa的方法_Alexa对话是AI驱动的对话界面新方法
nlp文本相似度_用几行代码在Python中搜索相似文本：一个NLP项目
爬虫goodreads数据_使用Python从Goodreads数据中预测好书
opengl层次建模_层次建模简介
如何用dds实现线性调频_用神经网络生成线性调频

评估模型如何建立_建立和评估分类ML模型相关推荐

非常规解释：分类ML模型的十大模型性能指标
2020-06-21 12:31:00 全文共3574字,预计学习时长11分钟图源:unsplash 本文将带大家了解10个最重要的模型性能指标,这些指标可用于评估分类模型的模型性能.一旦了解了指标 ...
github 建立_建立在线社区：GitHub教师
github 建立 by Gitter 通过吉特建立在线社区:GitHub教师 (Building Online Communities: GitHub Teacher) We talked to ...
python保存模型与参数_如何导出python中的模型参数
模型的保存和读取 1.tensorflow保存和读取模型:tf.train.Saver() .save()#保存模型需要用到save函数 save( sess, save_path, global_s ...
python做马尔科夫模型预测法_隐马尔可夫模型的前向算法和后向算法理解与实现（Python）...
前言隐马尔可夫模型(HMM)是可用于标注问题的统计学习模型,描述由隐藏的马尔可夫链随机生成观测序列的过程,属于生成模型. 马尔可夫模型理论与分析参考<统计学习方法>这本书,书上已经讲得 ...
caffe模型文件解析_「机器学习」截取caffe模型中的某层
通常情况下,训练好的caffe模型包含两个文件: prototxt:网络结构描述文件,存储了整个网络的图结构: caffemodel:权重文件,存储了模型权重的相关参数和具体信息对于某些大型的网络, ...
ssas表格模型权限控制_创建第一个SSAS表格模型数据库
ssas表格模型权限控制 Considering BI environment, when comparing Multidimensional Vs Tabular model databases ...
如何构建真实世界可用的 ML 模型？
Python 作为当前机器学习中使用最多的一门编程语言,有很多对应的机器学习库,最常用的莫过于 scikit-learn 了.本文我们介绍下如何使用sklearn进行实时预测.先来看下典型的机器学习工 ...
数据挖掘与数据化运营实战. 3.2　目标客户的预测（响应、分类）模型
3.2 目标客户的预测(响应.分类)模型这里的预测(响应.分类)模型包括流失预警模型.付费预测模型.续费预测模型.运营活动响应模型等. 预测(响应.分类)模型是数据挖掘中最常用的一种模型类型,几乎成 ...
结构方程模型的建立、拟合、评估、筛选和结果展示全过程
(R语言平台:模型构建.拟合.筛选及结果发表全流程:潜变量分析:组成变量分析:非线性关系处理.非正态数据.分组数据.嵌套数据分析与处理:混合效应模型:贝叶斯方法:经典案例练习及解读) 现代统计学理论和 ...

评估模型如何建立_建立和评估分类ML模型