使用随机森林填补缺失值

使用随机森林进行缺失值填补的思想：

X和y之间是有联系的，所以才能用X预测y;那么反过来，y也可以在一定程度上预测X。

当X中的某个特征x有缺失值时，我们将该特征看为target，y看作一个新特征（即X去除x和y组成特征向量，x作为target）；无缺失值的样本做训练集，有缺失值的样本做测试集，使用随机森林建模（可以是回归，也可是是分类），对缺失值进行预测。

当X中有多个特征有缺失值时，从缺失值最少的特征开始处理，此时其他缺失值用0填充；当该特征的缺失值用随机森林预测出来后，填补到原始数据中，之后继续按上述方法处理下一个缺失值。

import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import cross_val_score
from sklearn.impute import SimpleImputer

#导入原始数据
ori_x = load_boston().data
target = load_boston().targetpd.DataFrame(ori_x).isnull().sum()
#可以看出，原始数据中13个特征都没有缺失值

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
dtype: int64

missing_rate = 0.2 #设置缺失率
missing_cnt = int(np.ceil(missing_rate*ori_x.shape[0]*ori_x.shape[1])) #计算一共有多少个数据将缺失
rs = np.random.RandomState(805) #设置随机数种子
missing_row_idx = rs.randint(0,ori_x.shape[0],missing_cnt) #缺失数据的行索引
missing_col_idx = rs.randint(0,ori_x.shape[1],missing_cnt) #缺失数据的列索引missing_x = ori_x.copy()
missing_x[missing_row_idx,missing_col_idx] = np.nan
missing_x = pd.DataFrame(missing_x,columns=load_boston().feature_names)
missing_x.isnull().sum().sort_values()

DIS         78
AGE         81
PTRATIO     84
TAX         86
RM          88
RAD         88
INDUS       91
CHAS        92
ZN          96
CRIM        98
NOX        100
B          101
LSTAT      104
dtype: int64

missing_col_name = missing_x.isnull().sum().sort_values().index #将特征按照缺失值数量由小到大排列（因为要从缺失值最少的特征开始处理）
missing_col_name

Index(['DIS', 'AGE', 'PTRATIO', 'TAX', 'RM', 'RAD', 'INDUS', 'CHAS', 'ZN','CRIM', 'NOX', 'B', 'LSTAT'],dtype='object')

missing_reg = missing_x.copy()for col in missing_col_name:missing_x_y = missing_reg.copy()missing_x_y['target'] = target #将y作为新特征加入到特征矩阵rf = RandomForestRegressor(random_state=805)test_x = missing_x_y[missing_x_y[col].isnull()].drop(col,axis=1).fillna(0) #将该特征有缺失值的样本取出，drop掉该特征之后，将其他特征的缺失值用0填充missing_idx = list(missing_x_y[missing_x_y[col].isnull()][col].index) #获取该特征值有缺失值的样本的行索引train_x = missing_x_y[missing_x_y[col].notnull()].drop(col,axis=1).fillna(0) #将该特征没有缺失值的样本取出，drop掉该特征之后，将其他特征的缺失值用0填充，处理好之后的数据作为训练集的Xtrain_y = missing_x_y[missing_x_y[col].notnull()][col].values #将该特征没有缺失值的样本取出，并且只保留该特征的数据作为训练集的yrf.fit(train_x,train_y)pre_y = rf.predict(test_x)missing_reg.loc[missing_idx,col] = pre_y #用预测的值填补该特征的缺失值missing_reg.isnull().sum() #可以看到，经过处理之后，所有特征都没有缺失值了

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
dtype: int64

#不同填充方式最后预测结果的比较mean_imp = SimpleImputer(missing_values=np.nan,strategy='mean') #使用各特征的均值填补各特征的缺失值
missing_mean = mean_imp.fit_transform(missing_x)
missing_zero = missing_x.fillna(0) #使用0填补所有的缺失值rf = RandomForestRegressor(random_state=805)
mse = [cross_val_score(rf,ori_x,target,cv=10,scoring='neg_mean_squared_error').mean()*(-1),cross_val_score(rf,missing_reg,target,cv=10,scoring='neg_mean_squared_error').mean()*(-1),cross_val_score(rf,missing_mean,target,cv=10,scoring='neg_mean_squared_error').mean()*(-1),cross_val_score(rf,missing_zero,target,cv=10,scoring='neg_mean_squared_error').mean()*(-1)]
print('无缺失值的mse:{}\n回归填充缺失值的mse:{}\n均值填充缺失值的mse:{}\n0填充缺失值的mse:{}'.format(*mse))

#结果可视化展示：
x_labels = ['Full data', 'Reg Imputation', 'Mean Imputation', 'Zero Imputation']
colors = ['r', 'g', 'b', 'orange']
plt.figure(figsize=(12, 6))
ax = plt.subplot(111)
for i in np.arange(len(mse)): ax.barh(i, mse[i],color=colors[i], alpha=0.6, align='center') ax.set_title('Imputation Techniques with Boston Data') ax.text(mse[i],i,round(mse[i],2))ax.set_xlim(left=np.min(mse) * 0.9, right=np.max(mse) * 1.1) ax.set_yticks(np.arange(len(mse))) ax.set_xlabel('MSE') ax.set_yticklabels(x_labels)
plt.show()

使用随机森林填补缺失值相关推荐

利用随机森林填补缺失值
利用随机森林填补缺失值介绍利用随机森林填补缺失值介绍说到缺失值,我想各位在进行数据分析之前或多或少都是会遇到的.在做有关机器学习的项目的时候,出题人都是会给你一个好几万好几十万的数据,可能会出 ...
（机器学习）随机森林填补缺失值的思路和代码逐行详解
随机森林填补缺失值 1.使用0和均值来填补缺失值 2.用随机森林填补缺失值的思路 3.使用随机森林填补缺失值代码逐行详解 3.1导包,准备数据,以及创造缺失的数据集 3.2数据集中缺失值从少到多进行排 ...
python实现-用随机森林填补缺失值、均值填充0填充的比较
sklearn中,可以使用sklearn.impute.SimpleImputer来轻松地填充均值等 import numpy as np import pandas as pd import mat ...
机器学习之随机森林填补缺失值和众数填补缺失值
文章目录基础代码填充众数(add) 代码基础随机森林由Leo Breiman(2001)提出的一种分类算法,它通过自助法(bootstrap)重采样技术,从原始训练样本集N中有放回地重复随机 ...
特征工程-使用随机森林进行缺失值填补
特征工程-使用随机森林进行缺失值填补一.前言特征工程在传统的机器学习中是非常重要的一个步骤,我们对机器学习算法的优化通常是有限的.如果在完成任务时发现不管怎么优化算法得到的结果都不满意,这个时候就 ...
随机森林案例：回归森林填补缺失值
文章目录前言使用随机森林回归填补缺失值 1.导入库 2. 以波士顿数据集为例,导入完整的数据集并探索 3.为完整数据集放入缺失值 4. 使用0和均值来进行填补 5. 使用随机森林填补缺失值 6. ...
五、实例：在波士顿房价数据集上用随机森林回归填补缺失值
在波士顿房价数据集上用随机森林回归填补缺失值点击标题即可获取源代码和笔记一.引入我们从现实中收集的数据,几乎不可能是完美无缺的,往往都会有一些缺失值.面对缺失值,很多人选择的方式是直接将含有缺失 ...
案例2:随机森林来填补缺失值
使用随机森林回归来填补缺失值 1.导包先导入一些需要的包 import numpy as np import pandas as pd import matplotlib.pyplot as plt ...
R语言数据缺失值处理（随机森林，多重插补）
缺失值是指数据由于种种因素导致的数据不完整,可以分为机械原因和人为原因.对于缺失值我们通常采用以下几种方法来进行插补. 1.读取数据通过read.csv函数导入文档,也可以用其他函数读入,如open ...

使用随机森林填补缺失值

使用随机森林填补缺失值相关推荐

最新文章

热门文章