欠采样和过采样

简介 (Introduction)

The Imbalanced classification problem is what we face when there is a severe skew in the class distribution of our training data. Okay, the skew may not be extremely severe (it can vary), but the reason we identify imbalanced classification as a problem is because it can influence the performance on our Machine Learning algorithms.

吨他不均衡分类问题是,当有在我们的训练数据的类分布的严重扭曲了我们的脸。 好的,偏斜可能不会非常严重(可能会有所不同),但是我们将分类不平衡视为问题的原因是,它会影响我们的机器学习算法的性能。

One way the imbalance may affect our Machine Learning algorithm is when our algorithm completely ignores the minority class. The reason this is an issue is because the minority class is often the class that we are most interested in. For instance, when building a classifier to classify fraudulent and non-fraudulent transactions from various observations, the data is likely to have more non-fraudulent transactions than that of fraud — I mean think about it, it would be very worrying if we had an equal amount of fraudulent transactions as non-fraud.

不平衡可能影响我们的机器学习算法的一种方式是,当我们的算法完全忽略少数派类别时。 之所以会出现这个问题,是因为少数派类别通常是我们最感兴趣的类别。例如,当建立一个分类器以根据各种观察结果对欺诈性和非欺诈性交易进行分类时,数据可能会包含更多的非欺诈交易要比欺诈交易多-我的意思是,考虑一下,如果我们有同等数量的欺诈交易与非欺诈交易,那将非常令人担忧。

Figure 1: Example of class distribution for Fraud detection Problem
图1:欺诈检测问题的类分布示例

An approach to combat this challenge is Random Sampling. There are two main ways to perform random resampling, both of which have there pros and cons:

应对这种挑战的一种方法是随机采样。 执行随机重采样的主要方法有两种,两种方法各有利弊:

Oversampling — Duplicating samples from the minority class

过度采样 -复制少数群体的样本

Undersampling — Deleting samples from the majority class.

采样-从多数类别中删除样本。

In other words, Both oversampling and undersampling involve introducing a bias to select more samples from one class than from another, to compensate for an imbalance that is either already present in the data, or likely to develop if a purely random sample were taken (Source: Wikipedia).

换句话说,过采样和欠采样都涉及引入偏差以从一个类别中选择比另一个类别更多的样本,以补偿数据中已经存在的不平衡,或者如果采取纯随机样本可能会加剧不平衡(来源: Wikipedia )。

We define Random Sampling as a naive technique because when performed it assumes nothing of the data. It involves creating a new transformed version of our data in which a there is a new class distribution to reduce the influence of the data on our Machine Learning algorithm.

我们将随机采样定义为一种朴素的技术,因为执行时它不假设任何数据。 它涉及创建数据的新转换版本,其中存在新的类分布,以减少数据对我们的机器学习算法的影响。

Note: We refer to Random Resampling as naive because when performed it makes no assumptions of the data.

注意 :我们将“随机重采样”称为天真,因为执行此操作时不会假设数据。

In this article we will be leveraging the imbalanced-learn framework which was initiated in 2014 with the main focus being on SMOTE (another technique for imbalanced data) implementation. Over the years, additional oversampling and undersampling methods have been implemented as well as making the framework compatible with the popular machine learning framework scikit-learn. Visit Imbalanced-Learn for guides on installation and the full documentation.

在本文中,我们将利用imbalanced-learn框架,该框架于2014年启动,主要侧重于SMOTE(另一种用于不平衡数据的技术)的实施。 多年来,已经实现了其他的过采样和欠采样方法,并使该框架与流行的机器学习框架scikit-learn兼容。 访问Imbalanced-Learn ,以获取安装指南和完整文档。

from sklearn.datasets import make_classificationfrom imblearn.over_sampling import RandomOverSamplerfrom imblearn.under_sampling import RandomUnderSamplerfrom collections import Counter# defining the datasetX, y = make_classification(n_samples= 10000, weights=[.99])# class distributionprint(Counter(y))Counter({0: 9844, 1: 156})

For the full code you may visit my Github.

有关完整代码,您可以访问我的Github 。

随机过采样 (Random Oversampling)

Random Oversampling includes selecting random examples from the minority class with replacement and supplementing the training data with multiple copies of this instance, hence it is possible that a single instance may be selected multiple times.

随机过采样包括从少数类中选择随机样本进行替换,并用该实例的多个副本补充训练数据,因此单个实​​例可能会被多次选择。

“the random oversampling may increase the likelihood of overfitting occurring, since it makes exact copies of the minority class examples. In this way, a symbolic classifier, for instance, might construct rules that are apparently accurate, but actually cove one replicated example.” — Page 83, Learning from Imbalanced Data Sets, 2018.

“随机过采样可能会增加发生过拟合的可能性,因为它可以精确复制少数类的示例。 这样,例如,一个符号分类器就可以构建看似准确的规则,但实际上却涵盖了一个重复的示例。” —第83页, 从不平衡数据集中学习 ,2018年。

For Machine Learning algorithms affected by skewed distribution, such as artificial neural networks and SVMs, this is a highly effective technique. However, tuning the target class distribution is advised in many scenarios as seeking a balanced distribution for a severely imbalanced dataset can lead to the algorithm overfitting the minority class, in turn resulting in an increase of our generalization error.

对于受偏斜分布影响的机器学习算法(例如人工神经网络和SVM) ,这是一种非常有效的技术。 但是,在许多情况下建议调整目标类的分布,因为为严重不平衡的数据集寻求平衡的分布可能会导致算法过度拟合少数类,从而导致泛化误差增加。

Another thing we ought to be aware of is the increased computational cost. Increasing the number of examples in the minority class (especially for a severely skewed data set) may result in an increased computational when we train our model and considering the model is seeing the same examples multiple times, this isn’t a good thing.

我们应该注意的另一件事是计算成本的增加。 当我们训练模型并考虑模型多次看到相同的示例时,增加少数类中的示例数量(尤其是对于严重偏斜的数据集)可能会导致计算量增加。

Nonetheless, Oversampling is a pretty decent solution and should be tested. Here is how we can implement it in Python…

但是,过采样是一个相当不错的解决方案,应该对其进行测试。 这是我们如何在Python中实现它的方法…

# instantiating the random over sampler ros = RandomOverSampler()# resampling X, yX_ros, y_ros = ros.fit_resample(X, y)# new class distribution print(Counter(y_ros))Counter({0: 9844, 1: 9844})

随机欠采样 (Random Undersampling)

Random Undersampling is the opposite to Random Oversampling. This method seeks to randomly select and remove samples from the majority class, consequently reducing the number of examples in the majority class in the transformed data.

随机欠采样与随机过采样相反。 此方法试图从多数类中随机选择样本并从中删除样本,因此减少了转换数据中多数类中的示例数。

“In random under-sampling (potentially), vast quantities of data are discarded. […] This can be highly problematic, as the loss of such data can make the decision boundary between the minority and majority instances harder to learn, resulting in a loss in classification performance.” — Page 45, Imbalanced Learning: Foundations, Algorithms and Applications, 2013

“在随机欠采样中(潜在地),大量数据被丢弃。 […]这可能是一个很大的问题,因为此类数据的丢失会使少数实例和多数实例之间的决策边界更难于学习,从而导致分类性能下降。” —第45页, 学习失衡:基础,算法和应用 ,2013年

The result of undersampling is a transformed data set with less examples in the majority class — this process may be repeated until the number of examples in each class is equal.

欠采样的结果是转换后的数据集,而多数类中的样本较少—可以重复此过程,直到每个类别中的样本数相等为止。

Using this approach is effective in situations where the minority class has a sufficient amount of examples despite the severe imbalance. On the other hand, it is always important to consider the prospects of valuable information being deleted as we randomly remove them from our data set since we have no way to detect or preserve the examples that are information rich in the majority class.

在少数群体尽管存在严重失衡的情况下也有足够的例子的情况下,使用这种方法是有效的。 另一方面,考虑有价值的信息被删除的前景总是很重要的,因为我们无法将其从数据集中随机删除,因为我们无法检测或保留大多数类别的信息丰富的示例。

To better understand this method, here is a python implementation…

为了更好地理解此方法,这是一个python实现…

# instantiating the random undersamplerrus = RandomUnderSampler() # resampling X, yX_rus, y_rus = rus.fit_resample(X, y)# new class distributionprint(Counter(y_rus))Counter({0: 156, 1: 156})

结合两种随机采样技术 (Combining Both Random Sampling Techniques)

Combining both random sampling methods can occasionally result in overall improved performance in comparison to the methods being performed in isolation.

与单独执行的方法相比,将两种随机采样方法组合使用有时可能会整体上提高性能。

The concept is that we can apply a modest amount of oversampling to the minority class, which improves the bias to the minority class examples, whilst we also perform a modest amount of undersampling on the majority class to reduce the bias on the majority class examples.

其概念是,我们可以对少数派类别应用适度的过采样,从而改善对少数派类别示例的偏见,同时我们也对少数派类别进行适度的过采样,以减少多数派类别示例的偏见。

To implement this in Python, leveraging the imbalanced-learn framework, we may the sampling_strategy attribute in our oversampling and undersampling techniques.

为了实现这一点在Python,借力imbalanced-learn框架,我们可以将sampling_strategy在我们的过采样和采样技术属性。

# instantiating over and under samplerover = RandomOverSampler(sampling_strategy=0.5)under = RandomUnderSampler(sampling_strategy=0.8)# first performing oversampling to minority classX_over, y_over = over.fit_resample(X, y)print(f"Oversampled: {Counter(y_over)}")Oversampled: Counter({0: 9844, 1: 4922})# now to comine under sampling X_combined_sampling, y_combined_sampling = under.fit_resample(X_over, y_over)print(f"Combined Random Sampling: {Counter(y_combined_sampling)}")Combined Random Sampling: Counter({0: 6152, 1: 4922})

结语 (Wrap Up)

In this guide we discussed oversampling and undersampling for imbalanced classification. There are various occasions where we may be confronted with an imbalanced dataset and applying random sampling may provide us with a very good model to overcome this problem in training and still maintain a model that generalizes well to new examples.

在本指南中,我们讨论了不平衡分类的过采样和欠采样。 在很多情况下,我们可能会遇到数据集不平衡的情况,并且应用随机抽样可能会为我们提供一个很好的模型,以克服训练中的这一问题,并且仍然保持可以很好地推广到新示例的模型。

Let’s continue the conversation on LinkedIn…

让我们继续在LinkedIn上进行对话…

翻译自: https://towardsdatascience.com/oversampling-and-undersampling-5e2bbaf56dcf

欠采样和过采样


http://www.taodudu.cc/news/show-2229075.html

相关文章:

  • 欠采样临界采样matlab,信号临界采样、过采样、欠采样实验报告.doc
  • 抽样定理以及欠采样
  • 过采样、欠采样
  • python 欠采样_欠采样(undersampling)和过采样(oversampling)会对模型带来怎样的影响?...
  • python 欠采样_Python sklearn 实现过采样和欠采样
  • matlab随机欠采样,欠采样技术
  • 暗影精灵3等游戏本设置风扇静音
  • MATLAB/Simulink双馈风机调频模型,风电调频模型,基于三机九节点搭建含双馈风机的电力系统模型
  • Frsky X9D Plus遥控器和 Frisky R8 Pro接收机对频
  • R语言风玫瑰图绘制(附代码)
  • 变速恒频风电机组的优缺点_恒速和变速恒频风电系统简介
  • matlab出现边频带,边频信号的形成原因及分析
  • 如何用Python绘制多种风玫瑰图
  • matlab 风机风速,【资料】组合风速与风力机功率的Matlab仿真分析
  • 脉动风时程matlab程序,脉动风时程matlab程序.doc
  • PHP 获取图片信息exif
  • html制作古诗带图画大全,古诗配画图片大全简单
  • python画玫瑰曲线_「风向玫瑰图」python绘制风向玫瑰图和污染物玫瑰图 - seo实验室...
  • OpenCV中使用 cv2.calcHist()-画直方图案例
  • 变速恒频风电机组的优缺点_变速恒频风电机组控制系统可靠性分析
  • 漂浮式半潜风机(二)环境荷载
  • 左耳听风
  • 风力发电仿真系列-基于Simulink搭建的变速恒频双馈风力发电模型
  • 用 Python 绘制污染物玫瑰图
  • excel画风玫瑰图_如何用excel制作风向玫瑰图
  • Qt绘制简单的风向玫瑰图代码
  • 傅里叶变换中的假频**
  • 手把手教你使用VGG19做图像风格迁移
  • 脉动风时程matlab程序,脉动风时程matlab程序.docx
  • 变速恒频风电机组的优缺点_变速恒频双馈风力发电机的主要优点和基本原理

欠采样和过采样_过采样和欠采样相关推荐

  1. mh采样算法推导_深度学习:Gibbs 采样

    1. 什么是Gibbs采样 Gibbs采样是MH算法的一种特例(α==1),因此可以保证Gibbs抽取的样本,也构成一个非周期不可约稳定收敛的马氏链:Gibbs采样适用于样本是两维或以上的情况:通过积 ...

  2. 解决欠拟合和过拟合的几种方法

    深度学习 欠拟合和过拟合的问题 ... 如何解决欠拟合和过拟合的问题? 深度学习 前言 一.介绍 二.如何解决欠拟合问题 三.如何解决过拟合问题 总结 前言   我们可以将搭建的模型是否发生欠拟合或者 ...

  3. python 过采样 权重实现_不平衡数据集的处理 - osc_sqq5osi1的个人空间 - OSCHINA - 中文开源技术交流社区...

    一.不平衡数据集的定义 所谓的不平衡数据集指的是数据集各个类别的样本量极不均衡.以二分类问题为例,假设正类的样本数量远大于负类的样本数量,通常情况下通常情况下把多数类样本的比例接近100:1这种情况下 ...

  4. 空气培养皿采样后保存_环境监测基础知识——环境空气监测技术之布点采样

    环境空气布点与采样(HJ-194) 1. 点位布设 1.1. 一般原则 1.1.1. 采样点位应根据监测任务的目的. 要求布设, 必要时进行现场踏勘后确定 1.1.2. 所选点位应具有较好的代表性, ...

  5. adc量化单位_复习要点8:采样与量化、ADC0809、DAC0832

    ■基本概念 1.采样过程:信号采样就是将连续的模拟信号,通过采样开关按一定时间间隔的闭合和断开,将其抽样成一连串离散脉冲信号的过程: 2.采样周期:采样开关两次采样(闭合)的间隔时间T: 3.孔径时间 ...

  6. SAP QM 采样方案的c1 d1 c2 d2 --多重采样

    SAP QM 采样方案的c1 d1 c2 d2 --多重采样 使用QDP1创建采样方案的时候为什么只要填写c1 d1不用填写c2 d2等等呢? 首先,C1/C2-C7代表接受数 D1/D2-D7代表拒 ...

  7. DataScience:对严重不均衡数据集进行多种采样策略(随机过抽样、SMOTE过采样、SMOTETomek综合采样、改变样本权重等)简介、经验总结之详细攻略

    DataScience:对严重不均衡数据集进行多种采样策略(随机过抽样.SMOTE过采样.SMOTETomek综合采样.改变样本权重等)简介.经验总结之详细攻略 目录

  8. 采用拉丁超立方采样的电力系统概率潮流计算 拉丁超立方采样属于分层采样,是一种有效的用采样值反映随机变量的整体分布的方法

    采用拉丁超立方采样的电力系统概率潮流计算 (自适应核密度估计,自适应带宽核密度估计) 拉丁超立方采样属于分层采样,是一种有效的用采样值反映随机变量的整体分布的方法. 其目的是要保证所有的采样区域都能够 ...

  9. 什么是欠拟合现象_欠拟合和过拟合是什么?解决方法总结

    欠拟合与过拟合 欠拟合是指模型在训练集.验证集和测试集上均表现不佳的情况: 过拟合是指模型在训练集上表现很好,到了验证和测试阶段就大不如意了,即模型的泛化能力很差. 欠拟合和过拟合一直是机器学习训练中 ...

  10. [pytorch、学习] - 3.11 模型选择、欠拟合和过拟合

    参考 3.11 模型选择.欠拟合和过拟合 3.11.1 训练误差和泛化误差 在解释上述现象之前,我们需要区分训练误差(training error)和泛化误差(generalization error ...

最新文章

  1. Ubuntu 系统安装Visual Studio Code
  2. 华为鸿蒙系统强势来袭,呼之欲出的华为神作——鸿蒙2.0 强势来袭
  3. 【深度学习】网络中隐含层神经元节点的个数(需要学习的特征数目)
  4. android与mysql的交互,与Android中的外部SQLite数据库进行交互.
  5. 【第一届“文翁杯”现场竞技赛】(校内“欢乐”赛)T1—洗刷刷(dp)
  6. java sentence_Java Sentence類代碼示例
  7. matlab简单程序实例视频,matlab编程实例100例.docx
  8. 一步一步学习Servlet中Request和Response
  9. factorymenu什么意思_宏基20lsquo;显示屏AUTO和MENU是什么意思,在什么位置_已解决 - 阿里巴巴生意经...
  10. java setstate,5.state更新流程(setState里到底发生了什么)
  11. 有哪些是你踏入社会才明白的道理?
  12. Mac AI技术图像编辑软件:Luminar Neo
  13. windows下安装,配置gcc编译器
  14. 超星学习通 吉林大学 程序设计基础 实验07 递归程序设计(2022级)
  15. Ubuntu系统在VMWare中鼠标闪烁的问题解决方案
  16. 爬虫:Xpath定位
  17. Plugin with id 'com.novoda.bintray-release' not foun
  18. glog安装配置及使用
  19. 【大数据开发】SparkCore——Spark作业执行流程、RDD编程的两种方式、简单算子
  20. c语言dfs算法初步讲解,[转载]算法初步

热门文章

  1. 【Python Programe】使用Python发送语音验证
  2. 【windows】--- SQL Server 2008 超详细安装教程
  3. python联合vrep_vrep-python 控制方法
  4. java课设超市收银系统_超市收银系统java课程设计.doc
  5. E+H电磁流量计你知道多少?
  6. 区块链开发完整指南。如何开发一款区块链项目?
  7. iOS一代码搞定定位
  8. php 自定义 bin2hex,php bin2
  9. AltiumDesigner14.3.X下载安装破解教程
  10. 如何为人员办理离职停保