机器学习处理不平衡数据_在机器学习中处理不平衡数据

机器学习处理不平衡数据

As an ML engineer or data scientist, sometimes you inevitably find yourself in a situation where you have hundreds of records for one class label and thousands of records for another class label.

作为ML工程师或数据科学家，有时您不可避免地会遇到这样的情况：一个类标签有数百条记录，而另一个类标签有数千条记录。

Upon training your model you obtain an accuracy above 90%. You then realize that the model is predicting everything as if it’s in the class with the majority of records. Excellent examples of this are fraud detection problems and churn prediction problems, where the majority of the records are in the negative class. What do you do in such a scenario? That will be the focus of this post.

训练模型后，您可以获得90％以上的准确性。然后，您会意识到该模型正在预测所有内容，就好像它属于具有大部分记录的类一样。欺诈检测问题和客户流失预测问题就是一个很好的例子，其中大多数记录为负类。在这种情况下您会做什么？这将是这篇文章的重点。

收集更多数据 (Collect More Data)

The most straightforward and obvious thing to do is to collect more data, especially data points on the minority class. This will obviously improve the performance of the model. However, this is not always possible. Apart from the cost one would have to incur, sometimes it's not feasible to collect more data. For example, in the case of churn prediction and fraud detection, you can’t just wait for more incidences to occur so that you can collect more data.

最直接，最明显的方法是收集更多数据，尤其是有关少数群体的数据点。这显然会改善模型的性能。但是，这并不总是可能的。除了必须承担的费用外，有时收集更多数据也不可行。例如，对于流失预测和欺诈检测，您不能仅等待发生更多的事件以收集更多的数据。

考虑精度以外的指标 (Consider Metrics Other than Accuracy)

Accuracy is not a good way to measure the performance of a model where the class labels are imbalanced. In this case, it's prudent to consider other metrics such as precision, recall, Area Under the Curve (AUC) — just to mention a few.

精度不是衡量类标签不平衡的模型性能的好方法。在这种情况下，请谨慎考虑其他指标，例如精度，召回率，曲线下面积(AUC)-仅举几例。

Precision measures the ratio of the true positives among all the samples that were predicted as true positives and false positives. For example, out of the number of people our model predicted would churn, how many actually churned?

精度测量所有被预测为真阳性和假阳性的样本中真阳性的比率。例如，在我们的模型预测的流失人数中，实际上有多少人会流失？

Recall measures the ratio of the true positives from the sum of the true positives and the false negatives. For example, the percentage of people who churned that our model predicted would churn.

召回率衡量的是真实肯定与错误肯定的总和。例如，我们的模型预测的会搅动的人群会流失。

The AUC is obtained from the Receiver Operating Characteristics (ROC) curve. The curve is obtained by plotting the true positive rate against the false positive rate. The false positive rate is obtained by dividing the false positives by the sum of the false positives and the true negatives.

AUC从接收器工作特性(ROC)曲线获得。通过绘制真实的阳性率对假阳性率来获得曲线。误报率是通过将误报除以误报和真实否定之和得出的。

AUC closer to one is better, since it indicates that the model is able to find the true positives.

AUC越接近一个越好，因为它表明该模型能够找到真实的阳性结果。

Machine learning is rapidly moving closer to where data is collected — edge devices. Subscribe to the Fritz AI Newsletter to learn more about this transition and how it can help scale your business.

机器学习正Swift向收集数据的地方(边缘设备)靠近。订阅Fritz AI新闻通讯以了解有关此过渡及其如何帮助您扩展业务的更多信息。

强调少数民族阶层 (Emphasize the Minority Class)

Another way to deal with imbalanced data is to have your model focus on the minority class. This can be done by computing the class weights. The model will focus on the class with a higher weight. Eventually, the model will be able to learn equally from both classes. The weights can be computed with the help of scikit-learn.

处理不平衡数据的另一种方法是让模型关注少数群体。这可以通过计算类权重来完成。该模型将重点关注权重较高的课程。最终，该模型将能够从两个类中平均学习。权重可以借助scikit-learn进行计算。

from sklearn.utils.class_weight import compute_class_weightweights = compute_class_weight(‘balanced’, y.unique(), y)array([ 0.51722354, 15.01501502])

You can then pass these weights when training the model. For example, in the case of logistic regression:

然后，在训练模型时可以传递这些权重。例如，对于逻辑回归：

class_weights = { 0:0.51722354, 1:15.01501502}lr = LogisticRegression(C=3.0, fit_intercept=True, warm_start = True, class_weight=class_weights)

Alternatively, you can pass the class weights as balanced and the weights will be automatically adjusted.

或者，您可以将班级权重传递为balanced ，并且权重将自动调整。

lr = LogisticRegression(C=3.0, fit_intercept=True, warm_start = True, class_weight=’balanced’)

Here’s the ROC curve before the weights are adjusted.

这是调整权重之前的ROC曲线。

And here’s the ROC curve after the weights have been adjusted. Note the AUC moved from 0.69 to 0.87.

这是权重调整后的ROC曲线。请注意，AUC从0.69变为0.87。

尝试不同的算法 (Try Different Algorithms)

As you focus on the right metrics for imbalanced data, you can also try out different algorithms. Generally, tree-based algorithms perform better on imbalanced data. Furthermore, some algorithms such as LightGBM have hyperparameters that can be tuned to indicate that the data is not balanced.

当您专注于针对不平衡数据的正确指标时，您还可以尝试不同的算法。通常，基于树的算法在不平衡数据上表现更好。此外，某些算法(例如LightGBM)具有超参数，可以对其进行调整以指示数据不平衡。

生成综合数据 (Generate Synthetic Data)

You can also generate synthetic data to increase the number of records in the minority class — usually known as oversampling. This is usually done on the training set after doing the train test split. In Python, this can be done using the Imblearn package. One of the strategies that can be implemented from the package is known as the Synthetic Minority Over-sampling Technique (SMOTE). The technique is based on k-nearest neighbors.

您还可以生成综合数据，以增加少数派类别中的记录数量(通常称为过采样)。通常在进行火车测试拆分后，对训练集执行此操作。在Python中，可以使用Imblearn包来完成。可以从该软件包中实施的策略之一就是合成少数族裔过采样技术(SMOTE) 。该技术基于k最近邻。

When using SMOTE:

使用SMOTE时：

The first parameter is a float that indicates the ratio of the number of samples in the minority class to the number of samples in the majority class, once resampling has been done.

第一个参数是float ，表示完成重采样后，少数类中的样本数与多数类中的样本数之比。
The number of neighbors to be used to generate the synthetic samples can be specified via the k_neighbors parameter.

可以通过k_neighbors指定用于生成合成样本的k_neighbors 参数。

from imblearn.over_sampling import SMOTEsmote = SMOTE(0.8)X_resampled,y_resampled = smote.fit_resample(X.values,y.values)pd.Series(y_resampled).value_counts()0    96671    7733 dtype: int64

You can then fit your resampled data to your model.

然后，您可以将重新采样的数据拟合到模型中。

model = LogisticRegression()model.fit(X_resampled,y_resampled)predictions = model.predict(X_test)

多数类别欠采样 (Undersample the Majority Class)

You can also experiment on reducing the number of samples in the majority class. One such strategy that can be implemented is the NearMiss method. You can also specify the ratio just like in SMOTE, as well as the number of neighbors via n_neighbors.

您也可以尝试减少多数类中的样本数量。可以实施的一种这样的策略是NearMiss方法。您也可以像在n_neighbors一样指定比率，并通过n_neighbors邻居的数量。

from imblearn.under_sampling import NearMissunderSample = NearMiss(0.3,random_state=1545)pd.Series(y_resampled).value_counts()0  1110 1  333 dtype: int64

最后的想法 (Final Thoughts)

Other techniques that can be used include using building an ensemble of weak learners to create a strong classifier. Metrics such as precision-recall curve and area under curve (PR, AUC) are also worth trying when the positive class is the most important.

可以使用的其他技术包括使用一组弱学习者来创建强分类器。当肯定类别最重要时，诸如精确调用曲线和曲线下面积(PR，AUC)之类的指标也值得尝试。

As always, you should experiment with different techniques and settle on the ones that give you the best results for your specific problems. Hopefully, this piece has given some insights on how to get started.

与往常一样，您应该尝试不同的技术，然后选择能够为您的特定问题提供最佳结果的技术。希望这篇文章对如何入门提供了一些见解。

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to exploring the emerging intersection of mobile app development and machine learning. We’re committed to supporting and inspiring developers and engineers from all walks of life.

编者注： 心跳 是由贡献者驱动的在线出版物和社区，致力于探索移动应用程序开发和机器学习的新兴交集。 我们致力于为各行各业的开发人员和工程师提供支持和启发。

Editorially independent, Heartbeat is sponsored and published by Fritz AI, the machine learning platform that helps developers teach devices to see, hear, sense, and think. We pay our contributors, and we don’t sell ads.

Heartbeat在编辑上是独立的，由以下机构赞助和发布 Fritz AI ，一种机器学习平台，可帮助开发人员教设备看，听，感知和思考。 我们向贡献者付款，并且不出售广告。

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletters (Deep Learning Weekly and the Fritz AI Newsletter), join us on Slack, and follow Fritz AI on Twitter for all the latest in mobile machine learning.

如果您想做出贡献，请继续我们的 呼吁捐助者 。 您还可以注册以接收我们的每周新闻通讯(《 深度学习每周》 和《 Fritz AI新闻通讯》 )，并加入我们 Slack ，然后继续关注Fritz AI Twitter 提供了有关移动机器学习的所有最新信息。

翻译自: https://heartbeat.fritz.ai/dealing-with-imbalanced-data-in-machine-learning-18e45fea7bb5

机器学习处理不平衡数据

查看全文

http://www.taodudu.cc/news/show-863531.html

目标检测迁移学习_使用迁移学习检测疟疾
深度学习cnn人脸检测_用于对象检测的深度学习方法：解释了R-CNN
人口预测和阻尼-增长模型_使用分类模型预测利率-第2部分
jupyter 共享_可共享的Jupyter笔记本！
图像分割过分割和欠分割_使用图割的图像分割
跳板机连接数据库_跳板数据科学职业生涯回顾
模糊图像处理去除模糊_图像模糊如何工作
使用PyTorch进行手写数字识别，在20 k参数中获得99.5％的精度。
openai-gpt_您可以使用OpenAI GPT-3语言模型做什么？
梯度下降和随机梯度下降_梯度下降和链链接系统
三行情书代码_用三行代码优化您的交易策略
词嵌入网络嵌入_词嵌入简介
如何成为数据科学家_成为数据科学家的5大理由
大脑比机器智能_机器大脑的第一步
嵌入式和非嵌入式_我如何向非技术同事解释词嵌入
ai与虚拟现实_将AI推向现实世界
bert 无标记文本调优_使用BERT准确标记主观问答内容
机器学习线性回归学习心得_机器学习中的线性回归
安全警报该站点安全证书_深度学习如何通过实时犯罪警报确保您的安全
现代分层、聚集聚类算法_分层聚类：聚集性和分裂性-解释
特斯拉自动驾驶使用的技术_使用自回归预测特斯拉股价
熊猫分发_实用熊猫指南
救命代码_救命！如何选择功能？
回归模型评估_评估回归模型的方法
gan学到的是什么_GAN推动生物学研究
揭秘机器学习
投影仪投影粉色_DecisionTreeRegressor —停止用于将来的投影！
机器学习中的随机过程_机器学习过程
ci/cd heroku_在Heroku上部署Dash或Flask Web应用程序。简易CI / CD。
图像纹理合成_EnhanceNet：通过自动纹理合成实现单图像超分辨率