数据数据泄露泄露

介绍 (Introduction)

Data Leakage is when the model somehow knows the patterns in the test data during its training phase. In other words, the data that you are using to train your ML algorithm happens to have the information you are trying to predict.

数据泄漏是指模型在训练阶段以某种方式知道测试数据中的模式时的情况。换句话说，用于训练ML算法的数据恰好具有您要预测的信息。

Data leakage prevents the model to generalize well. It’s very difficult for a data scientist to identify data leakage. Some of the reasons for data leakage are

数据泄漏使模型无法很好地推广。对于数据科学家来说，识别数据泄漏非常困难。数据泄漏的一些原因是

Outlier and missing value treatment with central values before splitting分割前使用中心值处理离群值和缺失值
Scaling the data before splitting into training and testing在分为训练和测试之前对数据进行缩放
train your model with both train and test data.使用训练和测试数据训练模型。

Hyper-Parameter Tuning is the process of finding the best set of hyper-parameters of the ML algorithm that delivers best performance.

超参数调整是寻找可提供最佳性能的ML算法的最佳超参数集的过程。

For more on Hyper-Parameters and Tuning Techniques refer my previous article.

有关超参数和调优技术的更多信息，请参阅我的上一篇文章。

Most of the Hyper-Parameter Tuning Techniques uses cross-validation to select best set of Hyper-Parameters. Cross-validation splits the data into train test set and builds different models with different sets of hyper-parameters on train set and validates the performance on the test set. Eventually, it selects the best combination of parameters that gives highest performance.

大多数超参数调整技术都使用交叉验证来选择最佳的超参数集。交叉验证将数据拆分为训练测试集，并在训练集上使用不同的超参数集构建不同的模型，并验证测试集的性能。最终，它选择可以提供最高性能的最佳参数组合。

But when we perform preprocessing steps like scaling, imputing etc, with Tuning Techniques that uses cross-validation like Grid Search, Random Search etc, will cause Data Leakage.

但是，当我们执行诸如缩放，插补等预处理步骤时，使用诸如交叉搜索，网格搜索等交叉验证的“调整技术”将导致数据泄漏。

Let's understand this in more detail with code.

让我们用代码更详细地了解这一点。

#import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_splitX = load_breast_cancer().data
y = load_breast_cancer().target# split data into train and test
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=14)#initialize StandardScaler and transform the data
std = StandardScaler()
X_train_scale = std.fit_transform(X_train)
X_test_scale = std.transform(X_test)#defining parameters grid
grid_params = {'C': [0.01, 0.1, 1.0, 10], 'gamma':[0.1, 0.01, 10]}grid = GridSearchCV(SVC(),grid_params,cv = 3)
grid.fit(X_train_scale,y_train)#best set of perameters
grid.best_params_#best score with best set of perameters
grid.best_score_

In the above code, I first performed Scaling (StandardScaler) on train data and then trained my GridSearchCV with Support Vector Classifier as an estimator and cv = 3 i.e, 3-fold cross validation. With 3-fold cross validation, the train data would split into 3-groups, in each group it further split into 2 groups as train sets and 1 group as test set. For each model with unique set of parameters will be trained on train set and evaluated on the test set, later on the scores will be evaluated and model would be discarded. This process continues for all 3-groups.

在上面的代码中，我首先对火车数据执行了Scaling(StandardScaler ) ，然后使用支持向量分类器作为估计器对我的GridSearchCV进行了训练，cv = 3，即 3倍交叉验证。通过三折交叉验证，火车数据将分为3组，每组又分为火车组2组和测试组1组。对于每个具有唯一参数集的模型，将在训练集上对其进行训练，并在测试集上进行评估，随后将对分数进行评估，并将模型丢弃。所有3组继续进行此过程。

Here the model has influenced by the data in the test set (Data Leakage) while training, The data has leaked with StandardScaler operation. StandardScaler / z-score is calculated with mean and standard deviation.

此处，模型在训练时受测试集中的数据(数据泄漏)的影响。通过StandardScaler操作，数据已泄漏。使用平均值和标准偏差计算StandardScaler / z分数。

The train data has scaled (mean and S.D of StandardScaler are calculated on train data) before Grid Search and then with cross validation the train data split into train & test groups. The train groups that are using to train/fit the algorithm have already had the information about test group(i.e. mean and S.D of train data which is combination of train & test groups). This is how data leaks happens with Hyper-Parameter Tuning.

在进行网格搜索之前，已对火车数据进行了缩放(根据火车数据计算StandardScaler的均值和SD )，然后通过交叉验证将火车数据分为火车和测试组。用于训练/拟合算法的训练组已经具有有关测试组的信息(即训练数据和测试组的组合，即火车数据的平均值和SD)。这是通过超参数调整发生数据泄漏的方式。

解 (Solution)

The solution for data leakage in such cases is Pipeline.

在这种情况下，数据泄漏的解决方案是管道。

A pipeline is used to help automate machine learning workflows such as scaling, dimensionality reduction, model fitting, and validation etc. It basically takes multiple steps in a machine learning process and combine it into a single object which makes it easier to both develop and use as well as save and reuse later.

管道用于帮助自动化机器学习工作流，例如缩放，降维，模型拟合和验证等。它在机器学习过程中基本上采取了多个步骤，并将其组合为一个对象，这使得开发和使用更加容易以及以后保存和重用。

Let's understand how pipeline solves data leakage with code.

让我们了解管道如何使用代码解决数据泄漏。

#import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_splitX = load_breast_cancer().data
y = load_breast_cancer().target# split data into train and test
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=14)#creating pipeline variable
make_pipe = make_pipeline(StandardScaler(),SVC())#defining parameters grid
grid_params = {'C': [0.01, 0.1, 1.0, 10], 'gamma':[0.1, 0.01, 10]}grid_ = GridSearchCV(make_pipe,grid_params,cv = 3)
grid_.fit(X_train,y_train)#best set of perameters
grid_.best_params_#best score with best set of perameters
grid.best_score_

In the above code, I used StandardScaler and SVC with pipeline and then passed pipeline object to Grid Search with cv = 3.

在上面的代码中，我将StandardScaler和SVC与管道一起使用，然后将管道对象通过cv = 3传递给Grid Search。

In this situation, data is first split into 3 groups (cv=3) i.e. 2-train groups & 1 test group. On each group, scaling would be performed. So, there is no data leakage here.

在这种情况下，首先将数据分为3组(cv = 3)，即2组和1组测试。在每个组上，将执行缩放。因此，这里没有数据泄漏。

In that way we can avoid data leakage successfully.

这样，我们可以成功避免数据泄漏。

结论 (Conclusion)

When performing hyper-parameter tuning with techniques that uses cross-validation one should aware of data leakage. In such situations, using pipelines will help to avoid data leaks.

当使用交叉验证技术执行超参数调整时，应注意数据泄漏。在这种情况下，使用管道将有助于避免数据泄漏。

感谢您的阅读！ (Thank you for reading!)

Any feedback and comments are, greatly appreciated!

任何反馈和评论都非常感谢！

数据数据泄露泄露_通过超参数调整进行数据泄漏相关推荐

降维后的高维特征的参数_高维超参数调整简介
降维后的高维特征的参数 by Thalles Silva 由Thalles Silva 高维超参数调整简介 (An introduction to high-dimensional hyper-par ...
cox风险回归模型参数估计_信用风险管理：分类模型和超参数调整
cox风险回归模型参数估计 The final part aims to walk you through the process of applying different classificati ...
第十四章_超参数调整
文章目录 14.1 写在前面 14.2 超参数概述 14.2.1 什么是超参数,参数和超参数的区别 14.2.2 神经网络中包含哪些超参数 14.2.3 模型优化寻找最优解和正则项之间的关系 14.2 ...
mlflow_使用MLflow跟踪进行超参数调整
mlflow Hyperparameter tuning and optimization is a powerful tool in the field of AutoML. Tuning thes ...
超参数优化贝叶斯优化框架_mlmachine-使用贝叶斯优化进行超参数调整
超参数优化贝叶斯优化框架机器 (mlmachine) TL; DR (TL;DR) mlmachine is a Python library that organizes and acceler ...
交叉验证和超参数调整：如何优化您的机器学习模型
In the first two parts of this article I obtained and preprocessed Fitbit sleep data, split the data ...
使用GridSearchCV和RandomizedSearchCV进行超参数调整
In Machine Learning, a hyperparameter is a parameter whose value is used to control the learning pro ...
Lasso 和 Ridge回归中的超参数调整技巧
在这篇文章中,我们将首先看看Lasso和Ridge回归中一些常见的错误,然后我将描述我通常采取的步骤来优化超参数.代码是用Python编写的,我们主要依赖scikit-learn.本文章主要关注Las ...
贝叶斯优化xgboost_超参数调整xgboost和神经网络的hyperopt贝叶斯优化
贝叶斯优化xgboost Hyperparameters: These are certain values/weights that determine the learning process o ...

数据数据泄露泄露_通过超参数调整进行数据泄漏

介绍 (Introduction)

解 (Solution)

结论 (Conclusion)

感谢您的阅读！ (Thank you for reading!)

相关文章：

数据数据泄露泄露_通过超参数调整进行数据泄漏相关推荐

最新文章

热门文章