不平衡数据采样

Imbalance data is a case where the classification dataset class has a skewed proportion. For example, I would use the churn dataset from Kaggle for this article.

不平衡数据是分类数据集类具有不正确比例的情况。例如，对于本文，我将使用Kaggle的客户流失数据集。

We can see there is a skew in the Yes class compared to the No class. If we calculate the proportion, the Yes class proportion is around 20.4% of the whole dataset. Although, how you classify the imbalance data? The table below might help you.

我们可以看到，“是”类与“否”类相比存在偏差。如果我们计算比例，则“是”类别的比例约为整个数据集的20.4％。虽然，您如何分类不平衡数据？下表可能会对您有所帮助。

There are three cases of Imbalance — Mild, Moderate, and Extreme; depends on the minority class proportion to the whole dataset. In our example above, we only have a Mild case of imbalanced data.

有三种失衡情况：轻度，中度和极端；取决于少数类在整个数据集中的比例。在上面的示例中，我们仅遇到数据不平衡的轻微情况。

Now, why we need to care about imbalance data when creating our machine learning model? Well, imbalance class creates a bias where the machine learning model tends to predict the majority class. You don’t want the prediction model to ignore the minority class, right?

现在，为什么我们在创建机器学习模型时需要关心数据不平衡？好吧，不平衡类别会产生偏差，机器学习模型倾向于预测多数类别。您不希望预测模型忽略少数群体，对吗？

That is why there are techniques to overcome the imbalance problem — Undersampling and Oversampling. What is the difference between these two techniques?

这就是为什么有一些技术可以解决不平衡问题-欠采样和过采样。这两种技术有什么区别？

Undersampling would decrease the proportion of your majority class until the number is similar to the minority class. At the same time, Oversampling would resample the minority class proportion following the majority class proportion.

采样不足会降低您多数群体的比例，直到人数与少数群体相似为止。同时，过采样将按照多数类别比例对少数类别比例进行重新采样。

In this article, I would only write a specific technique for Oversampling called SMOTE and various variety of the SMOTE.

在本文中，我将只为过采样编写一种称为SMOTE和各种SMOTE的特定技术。

Just a little note, I am a Data Scientist who believes in leaving the proportion as it is because it is representing the data. It is better to try feature engineering before you jump into these techniques.

请注意，我是一名数据科学家，他相信保留比例不变，因为它代表数据。在跳入这些技术之前，最好先尝试特征工程。

冒烟 (SMOTE)

So, what is SMOTE? SMOTE or Synthetic Minority Oversampling Technique is an oversampling technique but SMOTE working differently than your typical oversampling.

那么，什么是SMOTE？ SMOTE或合成少数族裔过采样技术是一种过采样技术，但SMOTE的工作方式不同于典型的过采样。

In a classic oversampling technique, the minority data is duplicated from the minority data population. While it increases the number of data, it does not give any new information or variation to the machine learning model.

在经典的过采样技术中，少数数据是从少数数据总体中复制的。 虽然它增加了数据数量，但它并没有为机器学习模型提供任何新信息或变化。

For a reason above, Nitesh Chawla, et al. (2002) introduce a new technique to create synthetic data for oversampling purposes in their SMOTE paper.

由于上述原因， Nitesh Chawla等人。 (2002年)在他们的SMOTE论文中引入了一种新技术来创建用于过度采样目的的合成数据。

SMOTE works by utilizing a k-nearest neighbor algorithm to create synthetic data. SMOTE first start by choosing random data from the minority class, then k-nearest neighbors from the data are set. Synthetic data would then made between the random data and the randomly selected k-nearest neighbor. Let me show you the example below.

SMOTE通过使用k最近邻算法来创建合成数据。 SMOTE首先从少数类中选择随机数据开始，然后从数据中设置k个最近邻居。然后将在随机数据和随机选择的k最近邻居之间生成合成数据。让我向您展示以下示例。

The procedure is repeated enough times until the minority class has the same proportion as the majority class.

重复该过程足够的次数，直到少数派与多数派的比例相同为止。

I omit a more in-depth explanation because the passage above already summarizes how SMOTE work. In this article, I want to focus on SMOTE and its variation, as well as when to use it without touching much in theory. If you want to know more, let me attach the link to the paper for each variation I mention here.

我省略了更深入的解释，因为上面的段落已经总结了SMOTE的工作方式。在本文中，我想集中讨论SMOTE及其变体，以及何时使用它而又不涉及理论上的问题。如果您想了解更多信息，请让我在本文中提及的每个变化形式都将链接附加到论文上。

As preparation, I would use the imblearn package, which includes SMOTE and their variation in the package.

作为准备，我将使用imblearn程序包，其中包括SMOTE及其在程序包中的变体。

#Installing imblearnpip install -U imbalanced-learn

1.射击 (1. SMOTE)

We would start by using the SMOTE in their default form. We would use the same churn dataset above. Let’s prepare the data first as well to try the SMOTE.

我们将从以默认格式使用SMOTE开始。我们将使用上面相同的流失数据集。让我们先准备数据，然后尝试SMOTE。

If you realize from my explanation above, SMOTE is used to synthesize data where the features are continuous and a classification problem. For that reason, in this section, we only would try to use two continuous features with the classification target.

如果您从我上面的解释中认识到，SMOTE可用于合成特征连续且存在分类问题的数据。因此，在本节中，我们仅尝试将两个连续特征与分类目标一起使用。

import pandas as pdimport seaborns as sns#I read the csv churn data into variable called df. Here I would only use two continuous features CreditScore and Age with the target Exiteddf_example = df[['CreditScore', 'Age', 'Exited']]sns.scatterplot(data = df, x ='CreditScore', y = 'Age', hue = 'Exited')

As we can see in the above scatter plot between the ‘CreditScore’ and ‘Age’ feature, there are mixed up between the 0 and 1 class.

正如我们在上面的“ CreditScore”和“ Age”功能之间的散点图中所看到的那样，0类和1类之间混杂在一起。

Let’s try to oversampled the data using the SMOTE technique.

让我们尝试使用SMOTE技术对数据进行过采样。

#Importing SMOTEfrom imblearn.over_sampling import SMOTE#Oversampling the datasmote = SMOTE(random_state = 101)X, y = smote.fit_resample(df[['CreditScore', 'Age']], df['Exited'])#Creating a new Oversampling Data Framedf_oversampler = pd.DataFrame(X, columns = ['CreditScore', 'Age'])df_oversampler['Exited']sns.countplot(df_oversampler['Exited'])

As we can see in the graph above, class 0 and 1 now have a similar proportion. Let’s see how is it goes if we create a similar scatter plot like before.

正如我们在上图中所看到的，类别0和1现在具有相似的比例。让我们看看如果我们像以前那样创建类似的散点图，情况会如何。

sns.scatterplot(data = df_oversampler, x ='CreditScore', y = 'Age', hue = 'Exited')

Currently, we have the oversampled data to fill the area that previously was empty with the synthetic data.

当前，我们拥有过采样数据，以使用合成数据填充以前为空的区域。

The purpose of oversampling is, just as I stated before, to have a better prediction model. This technique was not created for any analysis purposes as every data created is synthetic, so that is a reminder.

正如我之前所说，过采样的目的是为了拥有更好的预测模型。此技术并非出于任何分析目的而创建，因为创建的每个数据都是合成的，因此提醒您。

For the reason above, we need to evaluate whether oversampling data leads to a better model or not. Let’s start by splitting the data to create the prediction model.

出于上述原因，我们需要评估是否对数据进行过采样可以得出更好的模型。让我们首先分割数据以创建预测模型。

# Importing the splitter, classification model, and the metricfrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import classification_report#Splitting the data with stratificationX_train, X_test, y_train, y_test = train_test_split(df_example[['CreditScore', 'Age']], df['Exited'], test_size = 0.2, stratify = df['Exited'], random_state = 101)

As an addition, you should only oversample your training data and not the whole data except if you would use the entire data as your training data. In case you want to split the data, you should split the data first before oversampled the training data.

另外，除非将整个数据用作训练数据，否则您应该仅对训练数据进行过度采样，而不对整个数据进行过度采样。如果要拆分数据，则应先对数据进行拆分，然后再对训练数据进行过度采样。

#Create an oversampled training datasmote = SMOTE(random_state = 101)X_oversample, y_oversample = smote.fit_resample(X_train, y_train)

Now we have both the imbalanced data and oversampled data, let’s try to create the classification model using both of these data. First, let’s see the performance of the Logistic Regression model trained with the imbalanced data.

现在，我们同时拥有不平衡数据和过采样数据，让我们尝试使用这两个数据创建分类模型。首先，让我们看看使用不平衡数据训练的Logistic回归模型的性能。

#Training with imbalance dataclassifier = LogisticRegression()classifier.fit(X_train, y_train)print(classification_report(y_test, classifier.predict(X_test)))

As we can see from the metrics, our Logistic Regression model trained with the imbalanced data tends to predict class 0 rather than class 1. The bias is in our model.

从指标中可以看出，使用不平衡数据训练的Logistic回归模型倾向于预测0类而不是1类。偏差存在于我们的模型中。

Let’s see how is the result of the model trained with the oversampled data.

让我们看看如何使用过采样的数据训练模型的结果。

#Training with oversampled dataclassifier_o = LogisticRegression()classifier_o.fit(X_oversample, y_oversample)print(classification_report(y_test, classifier_o.predict(X_test)))

The model is doing better at predicted class 1 in this case. In this case, we could say that the oversampled data helps our Logistic Regression model to predict the class 1 better.

在这种情况下，该模型在预测的1类上表现更好。在这种情况下，我们可以说过采样的数据有助于我们的Logistic回归模型更好地预测1类。

I could say that the oversampled data improve the Logistic Regression model for prediction purposes, although the context of ‘improve’ is once again back to the user.

我可以说过采样的数据出于预测目的改进了Logistic回归模型，尽管“改进”的上下文再次回到了用户手中。

2. SMOTE-NC (2. SMOTE-NC)

I have mention that SMOTE only works for continuous features. So, what to do if you have mixed (categorical and continuous) features? In this case, we have another variation of SMOTE called SMOTE-NC (Nominal and Continuous).

我已经提到SMOTE仅适用于连续功能。那么，如果您具有混合(分类和连续)功能，该怎么办？在这种情况下，我们还有SMOTE的另一个变体，称为SMOTE-NC(标称和连续)。

You might think, then, just transform the categorical data into numerical; therefore, we had a numerical feature for SMOTE to use. The problem is when we did that; we would have data that did not make any sense.

然后，您可能会想，只需将分类数据转换为数值数据即可；因此，我们具有供SMOTE使用的数值功能。问题是我们什么时候做的。我们将获得没有任何意义的数据。

For example, in the churn data above, we had ‘IsActiveMember’ categorical feature with the data either 0 or 1. If we oversampled this data with SMOTE, we could end up with oversampled data such as 0.67 or 0.5, which does not make sense at all.

例如，在上面的搅动数据中，我们具有“ IsActiveMember”分类功能，数据为0或1。如果我们使用SMOTE对数据进行过采样，则最终可能会得到诸如0.67或0.5之类的过采样数据，这毫无意义。完全没有

This is why we need to use SMOTE-NC when we have cases of mixed data. The premise is simple, we denote which features are categorical, and SMOTE would resample the categorical data instead of creating synthetic data.

这就是为什么当我们遇到混合数据的情况时，需要使用SMOTE-NC的原因。前提很简单，我们表示哪些特征是分类的，并且SMOTE会对分类数据重新采样而不是创建合成数据。

Let’s try applying SMOTE-NC. In this case, I would select another feature as an example (one categorical, one continuous).

让我们尝试应用SMOTE-NC。在这种情况下，我将选择另一个功能作为示例(一个分类，一个连续)。

df_example = df[['CreditScore', 'IsActiveMember', 'Exited']]

In this case, ‘CreditScore’ is the continuous feature, and ‘IsActiveMember’ is the categorical feature. Then, let’s split the data just like before.

在这种情况下，“ CreditScore”是连续功能，“ IsActiveMember”是分类功能。然后，让我们像以前一样拆分数据。

X_train, X_test, y_train, y_test = train_test_split(df_example[['CreditScore', 'IsActiveMember']],df['Exited'], test_size = 0.2,stratify = df['Exited'], random_state = 101)

Then, let’s create two different classification models once more; one trained with the imbalanced data and one with the oversampled data. First, let’s try SMOTE-NC to oversampled the data.

然后，让我们再次创建两个不同的分类模型。其中一位接受过不平衡数据的训练，另一位接受过采样的数据。首先，让我们尝试SMOTE-NC对数据进行过采样。

#Import the SMOTE-NCfrom imblearn.over_sampling import SMOTENC#Create the oversampler. For SMOTE-NC we need to pinpoint the column position where is the categorical features are. In this case, 'IsActiveMember' is positioned in the second column we input [1] as the parameter. If you have more than one categorical columns, just input all the columns positionsmotenc = SMOTENC([1],random_state = 101)X_oversample, y_oversample = smotenc.fit_resample(X_train, y_train)

With the data ready, let’s try to create the classifiers.

准备好数据后，让我们尝试创建分类器。

#Classifier with imbalance dataclassifier = LogisticRegression()classifier.fit(X_train, y_train)print(classification_report(y_test, classifier.predict(X_test)))

With the imbalance data, we can see the classifier favor the class 0 and ignore the class 1 completely. Then, how about if we trained it with the SMOTE-NC oversampled data.

有了不平衡数据，我们可以看到分类器偏爱类别0，而完全忽略了类别1。然后，如果我们使用SMOTE-NC过采样的数据对其进行训练呢？

#Classifier with SMOTE-NCclassifier_o = LogisticRegression()classifier_o.fit(X_oversample, y_oversample)print(classification_report(y_test, classifier_o.predict(X_test)))

Just like with SMOTE, the classifier with SMOTE-NC oversampled data give a new perspective to the machine learning model to predict the imbalanced data. It wasn’t necessarily the best, but it was better than the imbalance data.

就像使用SMOTE一样，具有SMOTE-NC过采样数据的分类器为机器学习模型预测不平衡数据提供了新视角。它不一定是最好的，但比不平衡数据要好。

3. Borderline-SMOTE (3. Borderline-SMOTE)

Borderline-SMOTE is a variation of the SMOTE. Just like the name implies, it has something to do with the border.

Borderline-SMOTE是SMOTE的变体。就像名称所暗示的那样，它与边框有关。

So, unlike with the SMOTE, where the synthetic data are created randomly between the two data, Borderline-SMOTE only makes synthetic data along the decision boundary between the two classes.

因此，与SMOTE不同，在这两个数据之间随机创建综合数据，而Borderline-SMOTE仅沿两个类别之间的决策边界制作综合数据。

Also, there are two kinds of Borderline-SMOTE; there are Borderline-SMOTE1 and Borderline-SMOTE2. The differences are simple; Borderline-SMOTE1 also oversampled the majority class where the majority data are causing misclassification in the decision boundary, while Borderline-SMOTE2 only oversampled the minority classes.

此外，还有两种Borderline-SMOTE；有Borderline-SMOTE1和Borderline-SMOTE2。区别很简单； Borderline-SMOTE1还对多数类进行了过度采样，其中多数数据导致决策边界中的分类错误，而Borderline-SMOTE2仅对少数类进行了过度采样。

Let’s try the Borderline-SMOTE with our previous data. I would once more only using the numerical features.

让我们用之前的数据尝试Borderline-SMOTE。我将再次只使用数字功能。

df_example = df[['CreditScore', 'Age', 'Exited']]

The above picture is the difference between oversampling data with SMOTE and Borderline-SMOTE1. It might slightly look similar, but we could see there are differences where the synthetic data are created.

上图是使用SMOTE和Borderline-SMOTE1对数据进行过采样之间的区别。它看起来可能有点相似，但是我们可以看到创建综合数据的地方有所不同。

How about the performances for the machine learning model? Let us try it. First, as usual, we split the data.

机器学习模型的性能如何？让我们尝试一下。首先，像往常一样，我们拆分数据。

X_train, X_test, y_train, y_test = train_test_split(df_example[['CreditScore', 'Age']], df['Exited'], test_size = 0.2,  stratify = df['Exited'], random_state = 101)

Then, we create the oversampled data by using Borderline-SMOTE.

然后，我们使用Borderline-SMOTE创建过采样数据。

#By default, the BorderlineSMOTE would use the Borderline-SMOTE1from imblearn.over_sampling import BorderlineSMOTEbsmote = BorderlineSMOTE(random_state = 101, kind = 'borderline-1')X_oversample_borderline, y_oversample_borderline = bsmote.fit_resample(X_train, y_train)

Lastly, let’s check the machine learning performance with the Borderline-SMOTE oversampled data.

最后，让我们用Borderline-SMOTE过采样数据检查机器学习性能。

classifier_border = LogisticRegression()classifier_border.fit(X_oversample_borderline, y_oversample_borderline)print(classification_report(y_test, classifier_border.predict(X_test)))

The performance doesn’t differ much from the model trained with the SMOTE oversampled data. This means that we should focus on the features instead of oversampling the data.

该性能与使用SMOTE过采样数据训练的模型没有太大差异。这意味着我们应该专注于功能而不是对数据进行过采样。

Borderline-SMOTE is used the best when we know that the misclassification often happens near the boundary decision. Otherwise, we could stay use the usual SMOTE. If you want to read more about the Borderline-SMOTE, you could check the paper here.

当我们知道错误分类经常发生在边界决策附近时，最好使用Borderline-SMOTE。否则，我们可以继续使用常规的SMOTE。如果您想了解有关Borderline-SMOTE的更多信息，可以在此处查看该文件。

4. Borderline-SMOTE SVM (4. Borderline-SMOTE SVM)

Another variation of Borderline-SMOTE is Borderline-SMOTE SVM, or we could just call it SVM-SMOTE.

Borderline-SMOTE的另一个变体是Borderline-SMOTE SVM，或者我们可以简称为SVM-SMOTE。

The main differences between SVM-SMOTE and the other SMOTE are that instead of using K-nearest neighbors to identify the misclassification in the Borderline-SMOTE, the technique would incorporate the SVM algorithm.

SVM-SMOTE与其他SMOTE之间的主要区别在于，该技术将结合SVM算法，而不是使用K最近邻居来识别Borderline-SMOTE中的错误分类。

In the SVM-SMOTE, the borderline area is approximated by the support vectors after training SVMs classifier on the original training set. Synthetic data will be randomly created along the lines joining each minority class support vector with a number of its nearest neighbors.

在SVM-SMOTE中，在原始训练集上对SVM分类器进行训练后，边界区域由支持向量近似。合成数据将沿着将每个少数群体支持向量与其多个最近邻居连接的直线随机创建。

What special about Borderline-SMOTE SVM compared to the Borderline-SMOTE is that more data are synthesized away from the region of class overlap. It focuses more on where the data is separated.

与Borderline-SMOTE相比，Borderline-SMOTE SVM的特殊之处在于，可以在类重叠区域之外合成更多数据。它更关注于数据分离的位置。

Just like before, let’s try to use the technique in the model creation. I would still use the same training data in the Borderline-SMOTE example.

和以前一样，让我们尝试在模型创建中使用该技术。在Borderline-SMOTE示例中，我仍将使用相同的训练数据。

from imblearn.over_sampling import SVMSMOTEsvmsmote = SVMSMOTE(random_state = 101)X_oversample_svm, y_oversample_svm = svmsmote.fit_resample(X_train, y_train)classifier_svm = LogisticRegression()classifier_svm.fit(X_oversample_svm, y_oversample_svm)print(classification_report(y_test, classifier_svm.predict(X_test)))

The performance is once more not differ much, although I could say that the model in this time slightly favored the class 0 more than when we use the other technique but not too much.

性能再一次相差不大，尽管我可以说这次的模型比使用其他技术时对类0的支持稍微多一点，但并不过分。

It depends on you once again, what are your prediction models target are and the business affected by it. If you want to read more about the Borderline-SMOTE SVM, you could check the paper here.

再次取决于您，您的预测模型目标是什么以及受其影响的业务。如果您想了解有关Borderline-SMOTE SVM的更多信息，可以在此处查看相关文章。

5.自适应合成采样(ADASYN) (5. Adaptive Synthetic Sampling (ADASYN))

ADASYN is another variation from SMOTE. ADASYN takes a more different approach compared to the Borderline-SMOTE. While Borderline-SMOTE tries to synthesize the data near the data decision boundary, ADASYN creates synthetic data according to the data density.

ADASYN是SMOTE的另一个变体。与Borderline-SMOTE相比，ADASYN采用了更为不同的方法。当Borderline-SMOTE尝试在数据决策边界附近合成数据时， ADASYN根据数据密度创建合成数据。

The synthetic data generation would inversely proportional to the density of the minority class. It means more synthetic data are created in regions of the feature space where the density of minority examples is low, and fewer or none where the density is high.

综合数据生成将与少数类别的密度成反比。这意味着在少数实例的密度较低的特征空间区域中创建了更多的合成数据，而在密度较高的区域则创建了更少或没有的数据。

In simpler terms, in an area where the minority class is less dense, the synthetic data are created more. Otherwise, the synthetic data is not made so much.

用更简单的话说，在少数族裔密度较小的区域，将创建更多的综合数据。否则，合成数据就不会太多。

Let’s see how the performance by using the ADASYN. I would still use the same training data in the Borderline-SMOTE example.

让我们看看使用ADASYN的性能如何。在Borderline-SMOTE示例中，我仍将使用相同的训练数据。

from imblearn.over_sampling import ADASYNadasyn = ADASYN(random_state = 101)X_oversample_ada, y_oversample_ada = adasyn.fit_resample(X_train, y_train)classifier_ada = LogisticRegression()classifier_ada.fit(X_oversample_ada, y_oversample_ada)print(classification_report(y_test, classifier_ada.predict(X_test)))

As we can see from the model performance above, the performance is slightly worse than when we use the other SMOTE method.

从上面的模型性能可以看出，该性能比使用其他SMOTE方法时稍差。

The problems might lie in the outliers. Just like I stated before, ADASYN would focus on the density data where the density is low. Often time, the low-density data is an outlier. The ADASYN approach would then put too much attention on these areas of the feature space, which may result in worse model performance. It might be better to remove the outlier before using the ADASYN.

问题可能出在异常值上。就像我之前说过的那样，ADASYN将专注于密度较低的密度数据。通常，低密度数据是一个异常值。然后，ADASYN方法将过多地关注特征空间的这些区域，这可能会导致模型性能变差。使用ADASYN之前，最好删除异常值。

If you want to read more about ADASYN, you could check the paper here.

如果您想了解有关ADASYN的更多信息，可以在这里查看该论文。

结论 (Conclusion)

Imbalanced data is a problem when creating a predictive machine learning model. One way to alleviate this problem is by oversampling the minority data.

创建预测性机器学习模型时，数据不平衡是一个问题。缓解此问题的一种方法是对少数群体数据进行超采样。

Instead of oversampling by replicating the data, we can oversample the data by creating synthetic data using the SMOTE technique. There are few variations of SMOTE, including:

可以通过使用SMOTE技术创建合成数据来对数据进行过采样，而不是通过复制数据进行过采样。 SMOTE的变体很少，包括：

SMOTE冒烟
SMOTE-NCSMOTE-NC
Borderline-SMOTE分界线
SVM-SMOTE支持向量机
ADASYNADASYN

I hope it helps!

希望对您有所帮助！

翻译自: https://towardsdatascience.com/5-smote-techniques-for-oversampling-your-imbalance-data-b8155bdbe2b5

不平衡数据采样

查看全文

http://www.taodudu.cc/news/show-4686903.html

tensorflow 数据输入与特征工程
2020_WWW_The Structure of Social Influence in Recommender Networks
长尾分布之DECOUPLING REPRESENTATION AND CLASSIFIER FOR LONG-TAILED RECOGNITION
PLM是做题家吗？一文速览预训练语言模型数学推理能力新进展
javascript 矩阵_JavaScript问题解决器：旋转图像矩阵
[ECCV 2020] Distribution-balanced loss for multi-label classification in long-tailed datasets
iphone忘记密码了怎么开锁
V831——人脸识别开锁
多线程开发Kafka消费者的方案和优劣
实训3——按键开锁
找人开锁被坑笔记
基于51单片机密码锁-舵机开锁-CXM
设计模式-生产者与消费者模式
CoVH之柯南开锁
《程序员的自我修养》后感【1】下
Arduino 开锁，刷卡开锁模块
无处不在的算法---《算法神探》读后感
电脑解锁后黑屏有鼠标_电脑黑屏后屏幕只有鼠标怎么办呢?
投资笔记3-建立资产认知
投资组合结构
这些年来什么才是最好的投资？
最好的投资是自己，有关怎样投资自己
最好的投资
长文 | LSTM和循环神经网络基础教程（PDF下载）
轻量级开发编辑器 sublime text 3 使用心得
PDF格式分析（二十三）Action动作
基于python的pdf文件处理系统
基于python fitz的pdf文件处理器--已开源
java 内存 pdf_jvm内存模型高清版.pdf
输入法快捷键

不平衡数据采样_过度采样不平衡数据的5种打击技术相关推荐

python 欠采样_欠采样-Python数据科学技术详解与商业项目实战精讲 - python自学网...
欠采样欠采样 Db类支持原生SQL查询操作,主要包括下面两个方法: query方法 query方法用于执行SQL查询操作,和select方法一样返回查询结果数据集(数组). 使用示例:Db::quer ...
excel数据透视_取消透视Excel数据的快速方法
excel数据透视 Before you can build a flexible pivot table, you might need to rearrange the data. For exa ...
mysql查看数据倾斜_深入理解hadoop数据倾斜
深入理解hadoop之数据倾斜 1.什么是数据倾斜我们在用map /reduce程序执行时,有时候会发现reduce节点大部分执行完毕,但是有一个或者几个reduce节点运行很慢,导致整个程序的处理 ...
判断数组中某个元素除自身外是否和其他数据不同_算法工程师要懂的3种算法数据结构：线性表详解...
算法思想有很多,业界公认的常用算法思想有8种,分别是枚举.递推.递归.分治.贪心.试探法.动态迭代和模拟.当然8种只是一个大概的划分,是一个"仁者见仁.智者见智"的问题. 其实这些 ...
python大数据项目_(价值1280)大数据项目实战之Python金融应用编程
朱彤老师,2009年博士毕业于北京大学光华管理学院金融系,对金融.数据分析与统计有着较为深刻的理解,多年来一直持续跟踪和研究金融量化分析与数据统计相关领域的进展与发展,对概率论.随机过程及其在金融中的 ...
python 3d大数据可视化_基于Python的数据可视化库pyecharts介绍
什么是pyecharts? pyecharts 是一个用于生成 Echarts 图表的类库. echarts 是百度开源的一个数据可视化 JS 库,主要用于数据可视化.pyecharts 是一个用于生 ...
大数据文字游戏_什么是大数据？
我进入数据行业多年,亲眼见证了当下大数据时代的到来,和以前的数据可能有很大的区别. 在以前,我们理解的数据可能是比如教育行业:学生的成绩,银行:大家的存款数据,各行各业都有自己的具体的数据信息. 在当 ...
mysql 不会丢失数据吗_讨论MySQL丢失数据的几种情况
1. 概述很多企业选择MySQL都会担心它的数据丢失问题,从而选择Oracle,但是其实并不十分清楚什么情况下,各种原因导致MySQL会丢失部分数据.本文不讨论Oracle和MySQL的优劣,仅仅关 ...
python爬取淘宝数据魔方_《淘宝数据魔方技术架构解析》阅读笔记
淘宝网拥有国内最具商业价值的海量数据.截至当前,每天有超过30亿的店铺.商品浏览记录,10亿在线商品数,上千万的成交.收藏和评价数据.如何从这些数据中挖掘出真正的商业价值,进而帮助淘宝.商家进行企业的 ...

不平衡数据采样_过度采样不平衡数据的5种打击技术