如何在评估机器学习模型时防止数据泄漏

本文讨论了评估模型性能时的数据泄漏问题以及避免数据泄漏的方法。

在模型评估过程中，当训练集的数据进入验证/测试集时，就会发生数据泄漏。这将导致模型对验证/测试集的性能评估存在偏差。让我们用一个使用Scikit-Learn的“波士顿房价”数据集的例子来理解它。数据集没有缺失值，因此随机引入100个缺失值，以便更好地演示数据泄漏。

import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_validate, train_test_split
from sklearn.metrics import mean_squared_error#Importing the dataset
data = pd.DataFrame(load_boston()['data'],columns=load_boston()['feature_names'])
data['target'] = load_boston()['target']#Split the input and target features
X = data.iloc[:,:-1].copy()
y = data.iloc[:,-1].copy()# Adding 100 random missing values
np.random.seed(11)
rand_cols = np.random.randint(0,X.shape[1],100)
rand_rows = np.random.randint(0,X.shape[0],100)
for i,j in zip(rand_rows,rand_cols):X.iloc[i,j] = np.nan#Splitting the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=11)#Initislizing KNN Regressor
knn = KNeighborsRegressor()#Initializing mode imputer
imp = SimpleImputer(strategy='most_frequent')#Initializing StandardScaler
standard_scaler = StandardScaler()#Imputing and scaling X_train
X_train_impute = imp.fit_transform(X_train).copy()
X_train_scaled = standard_scaler.fit_transform(X_train_impute).copy()#Running 5-fold cross-validation
cv = cross_validate(estimator=knn,X=X_train_scaled,y=y_train,cv=5,scoring="neg_root_mean_squared_error",return_train_score=True)#Calculating mean of the training scores of cross-validation
print(f'Training RMSE (with data leakage): {-1 * np.mean(cv["train_score"])}')#Calculating mean of the validation scores of cross-validation
print(f'validation RMSE (with data leakage): {-1 * np.mean(cv["test_score"])}')#fitting the model to the training data
lr.fit(X_train_scaled,y_train)#preprocessing the test data
X_test_impute = imp.transform(X_test).copy()
X_test_scaled = standard_scaler.transform(X_test_impute).copy()#Predictions and model evaluation on unseen data
pred = lr.predict(X_test_scaled)
print(f'RMSE on unseen data: {np.sqrt(mean_squared_error(y_test,pred))}')

在上面的代码中，‘X_train’是训练集(k-fold交叉验证)，‘X_test’用于对看不见的数据进行模型评估。上面的代码是一个带有数据泄漏的模型评估示例，其中，用于估算缺失值的模式(strategy= ’ most_frequent ‘)在’ X_train ‘上计算。类似地，用于缩放数据的均值和标准偏差也使用’ X_train ‘计算。’ X_train的缺失值将被输入，’ X_train '在k-fold交叉验证之前进行缩放。

在k-fold交叉验证中，’ X_train ‘被分割成’ k ‘折叠。在每次k-fold交叉验证迭代中，其中一个折用于验证(我们称其为验证部分)，其余的折用于训练(我们称其为训练部分)。每次迭代中的训练和验证部分都有已经使用’ X_train ‘计算的模式输入的缺失值。类似地，它们已经使用在’ X_train ‘上计算的平均值和标准偏差进行了缩放。这种估算和缩放操作会导致来自’ X_train '的信息泄露到k-fold交叉验证的训练和验证部分。这种信息泄漏可能导致模型在验证部分上的性能估计有偏差。下面的代码展示了一种通过使用管道来避免它的方法。

#Preprocessing and regressor pipeline
pipeline = Pipeline(steps=[['imputer',imp],['scaler',standard_scaler],['regressor',knn]])#Running 5-fold cross-validation using pipeline as estimator
cv = cross_validate(estimator=pipeline,X=X_train,y=y_train,cv=5,scoring="neg_root_mean_squared_error",return_train_score=True)#Calculating mean of the training scores of cross-validation
print(f'Training RMSE (without data leakage): {-1 * np.mean(cv["train_score"])}')#Calculating mean of the validation scores of cross-validation
print(f'validation RMSE (without data leakage): {-1 * np.mean(cv["test_score"])}')#fitting the pipeline to the training data
pipeline.fit(X_train,y_train)#Predictions and model evaluation on unseen data
pred = pipeline.predict(X_test)
print(f'RMSE on unseen data: {np.sqrt(mean_squared_error(y_test,pred))}')

在上面的代码中，我们已经在管道中包含了输入器、标量和回归器。在本例中，’ X_train '被分割为5个折，在每次迭代中，管道使用训练部分计算用于输入训练和验证部分中缺失值的模式。同样，用于衡量训练和验证部分的平均值和标准偏差也在训练部分上计算。这一过程消除了数据泄漏，因为在每次k-fold交叉验证迭代中，都在训练部分计算归责模式和缩放的均值和标准偏差。在每次k-fold交叉验证迭代中，这些值用于计算和扩展训练和验证部分。

我们可以看到在有数据泄漏和没有数据泄漏的情况下计算的训练和验证rmse的差异。由于数据集很小，我们只能看到它们之间的微小差异。在大数据集的情况下，这个差异可能会很大。对于看不见的数据，验证RMSE(带有数据泄漏)接近RMSE只是偶然的。

因此，使用管道进行k-fold交叉验证可以防止数据泄漏，并更好地评估模型在不可见数据上的性能。

作者：KSV Muralidhar

deephub翻译组

如何在评估机器学习模型时防止数据泄漏相关推荐

为什么一些机器学习模型需要对数据进行归一化？——1）归一化后加快了梯度下降求最优解的速度；2）归一化有可能提高精度...
为什么一些机器学习模型需要对数据进行归一化? http://www.cnblogs.com/LBSer/p/4440590.html 机器学习模型被互联网行业广泛应用,如排序(参见:排序学习实践).推 ...
如何评估机器学习模型的性能
您可以整天训练有监督的机器学习模型,但是除非您评估其性能,否则您永远无法知道模型是否有用.这个详细的讨论回顾了您必须考虑的各种性能指标,并对它们的含义和工作方式提供了直观的解释. 为什么需要评估? 让 ...
机器学习模型性能评估_如何评估机器学习模型的性能
机器学习模型性能评估 Table of contents: 目录: Why evaluation is necessary?为什么需要评估? Confusion Matrix混淆矩阵 Accurac ...
如何评估机器学习模型的商业价值
作者:amitvkulkarni CDA数据分析师编译概述对于任何评估来说,最难的是保持简单易操作,在数据科学中也是如此.在任何数据科学项目中,细化数据.微调模型.部署它们的迭代过程都是一个持续的 ...
为什么引入验证集来评估机器学习模型？只用训练集和测试集可以吗？
评估模型的重点是将数据划分为三个集合:训练集.验证集和测试集.在训练数据上训练模型,在验证数据上评估模型.一旦找到了最佳参数,就在测试数据上最后测试一次.你可能会问,为什么不是两个集合:一个训练集和一 ...
为什么一些机器学习模型需要对数据进行归一化？
http://www.cnblogs.com/LBSer/p/4440590.html 机器学习模型被互联网行业广泛应用,如排序(参见:排序学习实践).推荐.反作弊.定位(参见:基于朴素贝叶斯的定位算 ...
机器学习中级教程 7.数据泄漏
机器学习中级教程 1.介绍 2.缺失值 3.分类变量 4.管道(Pipelines) 5.交叉验证 6.梯度提升(XGBoost) 7.数据泄漏在本教程中,您将了解什么是数据泄漏以及如何防止它.如果 ...
机器学习笔记十五：随机森林（Random Forest）评估机器学习模型的特征重要性
随机森林 1. 随机森林介绍 1.1 租赁数据案例 2. 特征相关性分析(热图) 2.1 热图绘制 2.2 构建随机森林模型 2.3 不同特征合并的重要性 2.3.1 经纬度合并(分3类) 2.3.2 ...
机器学习数据模型_使用PyCaret将机器学习模型运送到数据—第二部分
机器学习数据模型 My previous post Machine Learning in SQL using PyCaret 1.0 provided details about integrat ...

如何在评估机器学习模型时防止数据泄漏

如何在评估机器学习模型时防止数据泄漏相关推荐

最新文章

热门文章