Task 3 特征工程 Datawhale零基础入门数据挖掘- 二手车交易价格预测

Tips:此部分为零基础入门数据挖掘的Task3特征工程部分，主要包含各种特征工程以及分析方法
赛题：零基础入没人能数据挖掘-二手车交易预测价格
地址：https://tianchi.aliyun.com/competition/entrance/231784/introduction?spm=5176.12281957.1004.1.38b02448ausjSX

文章内容

特征工程目标
内容介绍
代码示例

1 特征工程目标

1.1 对于特征进行进一步分析，并对于数据进行处理
1.2 完成对于特征工程的分析，并对于数据进行一些图表或者文字总结并打卡

2 内容介绍

常见的特征工程包括：
1.异常处理：

通过箱线图（或3-sigma）分析删除异常值
BOX-COX转换（处理有偏分布）；
长尾截断；

2.特征归一化/标准化

标准化（转换为标准正太分布）；
归一化（转换到[0,1]区间）；
针对幂率分布，可以采用公式： $log⁡((1+x)/(1+median))\log((1+x)/(1+median))$

3.数据分桶：

等频分桶
等距分桶
Best-KS分桶（类似利用基尼指数进行二分类）
卡方分桶

4.缺失值处理：

不处理（针对类似XGBoost等树模型）
删除（缺失数据太多）
插值补全，包括均值/中位数/众数/建模预测/多重插补/压缩感知补全/矩阵不全等；
分箱，缺失值一个箱

5.特征构造：

构造统计量特征，报告计数、求和、比例、标准差等
时间特征，包括相对时间和绝对时间，节假日，双休日等；
地理信息，包括分箱，分布编码等方法
非线性变换，包括log/平方/根号等
特征组合，特征交叉

6.特征筛选

过滤式（fliter）:先对数据进行特征选择，然后再训练学习器，常见的方法有Relif/方差选择法/相关系数法/卡方检验法/互信息法；
包裹式（wrapper）：直接把最终将要使用的学习器的性能作为特征子集的评价准则，常见方法有 LVM（Las Vegas Wrapper）
嵌入式（embedding）：结合过滤式和包裹式，学习器训练过程中自动进行了特征选择，常见的有 lasso 回归；

7.降维

PCA/LDA/ICA;
特征选择也是一种降维

代码示例

1.加载数据库

import pandas as pd
import numpy as np
import matplotlib.pyplot as pltimport seaborn as sns
from operator import itemgetter
%matplotlib inline

2.读取数据

path='./datalab/231784'
Train_data=pd.read_csv(path+'used_car_train_20200313.csv',sep=' ')
Test_data = pd.read_csv(path+'used_car_testA_20200313.csv', sep=' ')
print(Train_data.shape)
print(Test_data.shape)

Train_data.colunmns
Test_data.columns

3.删除异常值
这里定义了一个处理异常值的代码，如下

def outliers_proc(data, col_name, scale=3):"""用于清洗异常值，默认用 box_plot（scale=3）进行清洗:param data: 接收 pandas 数据格式:param col_name: pandas 列名:param scale: 尺度:return:"""def box_plot_outliers(data_ser, box_scale):"""利用箱线图去除异常值:param data_ser: 接收 pandas.Series 数据格式:param box_scale: 箱线图尺度，:return:"""iqr = box_scale * (data_ser.quantile(0.75) - data_ser.quantile(0.25))val_low = data_ser.quantile(0.25) - iqrval_up = data_ser.quantile(0.75) + iqrrule_low = (data_ser < val_low)rule_up = (data_ser > val_up)return (rule_low, rule_up), (val_low, val_up)
data_n = data.copy()data_series = data_n[col_name]rule, value = box_plot_outliers(data_series, box_scale=scale)index = np.arange(data_series.shape[0])[rule[0] | rule[1]]print("Delete number is: {}".format(len(index)))data_n = data_n.drop(index)data_n.reset_index(drop=True, inplace=True)print("Now column number is: {}".format(data_n.shape[0]))index_low = np.arange(data_series.shape[0])[rule[0]]outliers = data_series.iloc[index_low]print("Description of data less than the lower bound is:")print(pd.Series(outliers).describe())index_up = np.arange(data_series.shape[0])[rule[1]]outliers = data_series.iloc[index_up]print("Description of data larger than the upper bound is:")print(pd.Series(outliers).describe())fig, ax = plt.subplots(1, 2, figsize=(10, 7))sns.boxplot(y=data[col_name], data=data, palette="Set1", ax=ax[0])sns.boxplot(y=data_n[col_name], data=data_n, palette="Set1", ax=ax[1])return data_n

调用上述函数，对数据集进行异常值处理，这里以power为例

Train_data = outliers_proc(Train_data, 'power', scale=3)

4.特征构造
将训练集和测试集放在一起，方便构造特征

Train_data['train']=1
Test_data['train']=0
data=pd.concat([Train_data,Test_data],ignore_index=True)

使用时间：data[‘createDate’]-data[‘regData’],反应汽车使用时间，一般来说价格与使用时间成反比，不过要注意，数据里有时间出错的格式，所以我们需要error =‘coerce’。这里构造了一个使用时间的特征：使用时间，依据的两个特征是制造时间-注册时间，再将时间变为天数

data['used_time']=pd.to_datetine(data['acreateDate'],format='%Y%m%d'),errors='coerce')-pd.to_datetime(data['regDate'],format='%Y%m%d',errors='corece')).dt.days

看一下空数据，有 15k 个样本的时间是有问题的，我们可以选择删除，也可以选择放着。但是这里不建议删除，因为删除缺失数据占总样本量过大，7.5%。我们可以先放着，因为如果我们 XGBoost 之类的决策树，其本身就能处理缺失值，所以可以不用管；

data['used_time'].isnull().sum()

从邮编中提取城市信息，相当于加入了先验知识

data['city']=data['regionCode'].apply(lambda x:str(x)[:-3])
data=data

比如计算某品牌的销售统计量，这里要以 train 的数据计算统计量

Train_gb = Train_data.groupby("brand")
all_info = {}
for kind, kind_data in Train_gb:info = {}kind_data = kind_data[kind_data['price'] > 0]info['brand_amount'] = len(kind_data)info['brand_price_max'] = kind_data.price.max()info['brand_price_median'] = kind_data.price.median()info['brand_price_min'] = kind_data.price.min()info['brand_price_sum'] = kind_data.price.sum()info['brand_price_std'] = kind_data.price.std()info['brand_price_average'] = round(kind_data.price.sum() / (len(kind_data) + 1), 2)all_info[kind] = info
brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": "brand"})
data = data.merge(brand_fe, how='left', on='brand')

数据分桶以 power 为例。这时候我们的缺失值也进桶了，为什么要做数据分桶呢，原因有很多，= =

离散后稀疏向量内积乘法运算速度更快，计算结果也方便存储，容易扩展；
离散后的特征对异常值更具鲁棒性，如 age>30 为 1 否则为 0，对于年龄为 200 的也不会对模型造成很大的干扰；
LR 属于广义线性模型，表达能力有限，经过离散化后，每个变量有单独的权重，这相当于引入了非线性，能够提升模型的表达能力，加大拟合；
离散后特征可以进行特征交叉，提升表达能力，由 M+N 个变量编程 M*N 个变量，进一步引入非线形，提升了表达能力；
特征离散后模型更稳定，如用户年龄区间，不会因为用户年龄长了一岁就变化

bin = [i*10 for i in range(31)]
data['power_bin'] = pd.cut(data['power'], bin, labels=False)
data[['power_bin', 'power']].head()

删除不需要的数据

data=data.drop(['creatDate','regDate','regionCode'],axis=1)
print(data.shape)
data.colunmns

目前的数据其实已经可以给树模型使用了，所以导出一下

data.to_csv('data_for_tree.csv',index=0)

由于不同模型对数据集的要求不同，所以我们还可以构造易分特征给LR NN之类的模型用。先看一下数据分布：

data['power'].plot.hist()

power的直方图显示是为：

刚刚已经对train进行异常值处理，但是现在还有这么奇怪的分布是因为test中的power异常值，所以其实train中的异常值可以考虑删除，可以用长尾分布截断来代替

Train_data['power'].plot.hist()

我们对其去log,再做归一化

from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
data['power'] = np.log(data['power'] + 1)
data['power'] = ((data['power'] - np.min(data['power'])) / (np.max(data['power']) - np.min(data['power'])))
data['power'].plot.hist()

km的比较正常，应该是做过分桶了。由此我们可以直接做归一化

data['kilometer'].plot.hist()
data['kilometer'] = ((data['kilometer'] - np.min(data['kilometer'])) / (np.max(data['kilometer']) - np.min(data['kilometer'])))
data['kilometer'].plot.hist()

除此之外，还包括构造如下的统计量特征：
brand_amount，brand_price_average,brand_price_max,brand_price_median,brand_price_min,brand_price_srd,brand_price_sum,下面就将建造的这些特征变量进行归一化处理：

def max_min(x): return (x-np.min(x))/(np.max(x)-np.min(x))data['brand_amount'] = ((data['brand_amount'] - np.min(data['brand_amount'])) / (np.max(data['brand_amount']) - np.min(data['brand_amount'])))
data['brand_price_average'] = ((data['brand_price_average'] - np.min(data['brand_price_average'])) / (np.max(data['brand_price_average']) - np.min(data['brand_price_average'])))
data['brand_price_max'] = ((data['brand_price_max'] - np.min(data['brand_price_max'])) / (np.max(data['brand_price_max']) - np.min(data['brand_price_max'])))
data['brand_price_median'] = ((data['brand_price_median'] - np.min(data['brand_price_median'])) /(np.max(data['brand_price_median']) - np.min(data['brand_price_median'])))
data['brand_price_min'] = ((data['brand_price_min'] - np.min(data['brand_price_min'])) / (np.max(data['brand_price_min']) - np.min(data['brand_price_min'])))
data['brand_price_std'] = ((data['brand_price_std'] - np.min(data['brand_price_std'])) / (np.max(data['brand_price_std']) - np.min(data['brand_price_std'])))
data['brand_price_sum'] = ((data['brand_price_sum'] - np.min(data['brand_price_sum'])) / (np.max(data['brand_price_sum']) - np.min(data['brand_price_sum'])))

对类别特征进行ONE-HOT编码

data=pd.get_dummies(data,columns=['model','brand','bodyType','fuelType','gearbox','notrepairedDamage','power_bin'])
print(data.shape)
data.columns
data.to_csv('data_for_lr.csv',index=0)

5.特征筛选
1)过滤式
相关性分析

print(data['power'].corr(data['price'],method='spearman'))
print(data['kilometer'].corr(data['price'],method='spearman'))
print(data['brand_amount'].corr(data['price'],method='spearman'))
print(data['brand_price_average'].corr(data['price'],method='spearman'))
print(data['brand_price_max'].corr(data['price'],method='spearman'))
print(data['brand_price_median'].corr(data['price'],method='spearman'))

也可以画图显示，对特征之间的相关性可视化

data_numeric=data[['power','kilometer','brand_amount','brand_price_average', 'brand_price_max','brand_price_median']]
correlation=data_numeric.corr()
f,ax=plt.subplots(figsize=(7,7))
plt.title('Correlation of Numeric Features with Price',y=1,size=16)
sns.heatmap(correlation,square=True,vmax=0.8)

2）包裹式
需要注意下：下载速度可能会很慢（这种方法还没尝试过）

！pip install mlxtend

运行如下可能会报错，因此可以采取更改类型或者fillna(’’)

# k_feature 太大会很难跑，没服务器，所以提前 interrupt 了
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression
sfs = SFS(LinearRegression(),k_features=10,forward=True,floating=False,scoring = 'r2',cv = 0)
x = data.drop(['price'], axis=1)
x = x.fillna(0)
y = data['price']
sfs.fit(x, y)
sfs.k_feature_names_

也可以通过作图观察边际收益

from mlxtend.plotting import plot_squential_feature_selection as plot_sfs
import maplotlib.pyplot as plt
fig1=plot_sfs(sfs.get_metric_dict(),kind='std_dev')
plt.grid()
plt.show()

3)嵌入式
Lasso回归和决策树可以完成嵌入式特征选择，大部分情况下都是用嵌入式做特征筛选。

3.4 经验总结

     1. 特征工程是比赛中最至关重要的的一块，特别的传统的比赛，大家的模型可能都差不多，调参带来的效果增幅是非常有限的，但特征工程的好坏往往会决定了最终的排名和成绩。2. 特征工程的主要目的还是在于将数据转换为能更好地表示潜在问题的特征，从而提高机器学习的性能。比如，异常值处理是为了去除噪声，填补缺失值可以加入先验知识等。3.特征构造也属于特征工程的一部分，其目的是为了增强数据的表达。4.有些比赛的特征是匿名特征，这导致我们并不清楚特征相互直接的关联性，这时我们就只有单纯基于特征进行处理，比如装箱，groupby，agg 等这样一些操作进行一些特征统计，此外还可以对特征进行进一步的 log，exp 等变换，或者对多个特征进行四则运算（如上面我们算出的使用时长），多项式组合等然后进行筛选。由于特性的匿名性其实限制了很多对于特征的处理，当然有些时候用 NN 去提取一些特征也会达到意想不到的良好效果。5.对于知道特征含义（非匿名）的特征工程，特别是在工业类型比赛中，会基于信号处理，频域提取，丰度，偏度等构建更为有实际意义的特征，这就是结合背景的特征构建，在推荐系统中也是这样的，各种类型点击率统计，各时段统计，加用户属性的统计等等，这样一种特征构建往往要深入分析背后的业务逻辑或者说物理原理，从而才能更好的找到 magic。6.当然特征工程其实是和模型结合在一起的，这就是为什么要为 LR NN 做分桶和特征归一化的原因，而对于特征的处理效果和特征重要性等往往要通过模型来验证。7.的来说，特征工程是一个入门简单，但想精通非常难的一件事。