
介绍 (Introduction)

Data Leakage is when the model somehow knows the patterns in the test data during its training phase. In other words, the data that you are using to train your ML algorithm happens to have the information you are trying to predict.

数据 泄漏是指模型在训练阶段以某种方式知道测试数据中的模式时的情况。 换句话说,用于训练ML算法的数据恰好具有您要预测的信息。

Data leakage prevents the model to generalize well. It’s very difficult for a data scientist to identify data leakage. Some of the reasons for data leakage are

数据泄漏使模型无法很好地推广。 对于数据科学家来说,识别数据泄漏非常困难。 数据泄漏的一些原因是

  • Outlier and missing value treatment with central values before splitting分割前使用中心值处理离群值和缺失值
  • Scaling the data before splitting into training and testing在分为训练和测试之前对数据进行缩放
  • train your model with both train and test data.使用训练和测试数据训练模型。

Hyper-Parameter Tuning is the process of finding the best set of hyper-parameters of the ML algorithm that delivers best performance.


For more on Hyper-Parameters and Tuning Techniques refer my previous article.


Most of the Hyper-Parameter Tuning Techniques uses cross-validation to select best set of Hyper-Parameters. Cross-validation splits the data into train test set and builds different models with different sets of hyper-parameters on train set and validates the performance on the test set. Eventually, it selects the best combination of parameters that gives highest performance.

大多数超参数调整技术都使用交叉验证来选择最佳的超参数集。 交叉验证将数据拆分为训练测试集,并在训练集上使用不同的超参数集构建不同的模型,并验证测试集的性能。 最终,它选择可以提供最高性能的最佳参数组合。

But when we perform preprocessing steps like scaling, imputing etc, with Tuning Techniques that uses cross-validation like Grid Search, Random Search etc, will cause Data Leakage.


Let's understand this in more detail with code.


#import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_splitX = load_breast_cancer().data
y = load_breast_cancer().target# split data into train and test
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=14)#initialize StandardScaler and transform the data
std = StandardScaler()
X_train_scale = std.fit_transform(X_train)
X_test_scale = std.transform(X_test)#defining parameters grid
grid_params = {'C': [0.01, 0.1, 1.0, 10], 'gamma':[0.1, 0.01, 10]}grid = GridSearchCV(SVC(),grid_params,cv = 3)
grid.fit(X_train_scale,y_train)#best set of perameters
grid.best_params_#best score with best set of perameters

In the above code, I first performed Scaling (StandardScaler) on train data and then trained my GridSearchCV with Support Vector Classifier as an estimator and cv = 3 i.e, 3-fold cross validation. With 3-fold cross validation, the train data would split into 3-groups, in each group it further split into 2 groups as train sets and 1 group as test set. For each model with unique set of parameters will be trained on train set and evaluated on the test set, later on the scores will be evaluated and model would be discarded. This process continues for all 3-groups.

在上面的代码中,我首先对火车数据执行了Scaling(StandardScaler ) ,然后使用支持向量分类作为估计器对我的GridSearchCV进行了训练,cv = 3, 3倍交叉验证。 通过三折交叉验证,火车数据将分为3组,每组又分为火车组2组和测试组1组。 对于每个具有唯一参数集的模型,将在训练集上对其进行训练,并在测试集上进行评估,随后将对分数进行评估,并将模型丢弃。 所有3组继续进行此过程。

Image by Author

Here the model has influenced by the data in the test set (Data Leakage) while training, The data has leaked with StandardScaler operation. StandardScaler / z-score is calculated with mean and standard deviation.

此处,模型在训练时受测试集中的数据(数据泄漏)的影响。通过StandardScaler操作,数据已泄漏。 使用平均值和标准偏差计算StandardScaler / z分数。

Image by Author

The train data has scaled (mean and S.D of StandardScaler are calculated on train data) before Grid Search and then with cross validation the train data split into train & test groups. The train groups that are using to train/fit the algorithm have already had the information about test group(i.e. mean and S.D of train data which is combination of train & test groups). This is how data leaks happens with Hyper-Parameter Tuning.

在进行网格搜索之前,已对火车数据进行了缩放(根据火车数据计算StandardScaler的均值和SD ),然后通过交叉验证将火车数据分为火车和测试组。 用于训练/拟合算法的训练组已经具有有关测试组的信息(即训练数据和测试组的组合,即火车数据的平均值和SD)。 这是通过超参数调整发生数据泄漏的方式。

解 (Solution)

The solution for data leakage in such cases is Pipeline.


A pipeline is used to help automate machine learning workflows such as scaling, dimensionality reduction, model fitting, and validation etc. It basically takes multiple steps in a machine learning process and combine it into a single object which makes it easier to both develop and use as well as save and reuse later.


Let's understand how pipeline solves data leakage with code.


#import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_splitX = load_breast_cancer().data
y = load_breast_cancer().target# split data into train and test
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=14)#creating pipeline variable
make_pipe = make_pipeline(StandardScaler(),SVC())#defining parameters grid
grid_params = {'C': [0.01, 0.1, 1.0, 10], 'gamma':[0.1, 0.01, 10]}grid_ = GridSearchCV(make_pipe,grid_params,cv = 3)
grid_.fit(X_train,y_train)#best set of perameters
grid_.best_params_#best score with best set of perameters

In the above code, I used StandardScaler and SVC with pipeline and then passed pipeline object to Grid Search with cv = 3.

在上面的代码中,我将StandardScaler和SVC与管道一起使用,然后将管道对象通过cv = 3传递给Grid Search。

Image by Author

In this situation, data is first split into 3 groups (cv=3) i.e. 2-train groups & 1 test group. On each group, scaling would be performed. So, there is no data leakage here.

在这种情况下,首先将数据分为3组(cv = 3),即2组和1组测试。 在每个组上,将执行缩放。 因此,这里没有数据泄漏。

In that way we can avoid data leakage successfully.


结论 (Conclusion)

When performing hyper-parameter tuning with techniques that uses cross-validation one should aware of data leakage. In such situations, using pipelines will help to avoid data leaks.

当使用交叉验证技术执行超参数调整时,应注意数据泄漏。 在这种情况下,使用管道将有助于避免数据泄漏。

感谢您的阅读! (Thank you for reading!)

