基于Xgboost的不均衡数据分类

1.项目分析与设计

该项目通过美国人口普查数据训练一个模型来预测美国人口收入水平。数据集上包含199523个训练数据和99762个测试数据，各包含了41个属性。经分析，该数据包含了人口统计信息、年龄、贷款信息、国籍、种族等信息。属性数据中有包含空值和有偏分布等问题，处理思路如下：
1.读取数据，观察特征及其分布
2.分析缺失情况，处理缺失值
3.异常值处理
4.对分类变量进行哑编码
5.用随机森林进行重要特征筛选
6.重采样对不均衡数据进行处理
7.构建XGBOOST模型，并进行建模分析预测

2. 数据探索

In [1]:

import numpy as np
import pandas as pd

In [2]:

train_df=pd.read_csv('train.csv')
test_df=pd.read_csv('test.csv')
print ('train_df:%s,%s'%train_df.shape)
print ('test_df:%s,%s'%test_df.shape)

train_df:199523,41
test_df:99762,41

In [3]:

##检查因变量train_df.income_level.unique()
test_df.income_level.unique()

Out[3]:

array(['-50000', '50000+.'], dtype=object)

In [4]:

#为了便于分析，将变量编码为0,1
train_df.loc[train_df['income_level']==-50000,'income_level']=0
train_df.loc[train_df['income_level']== 50000,'income_level']=1
test_df.loc[test_df['income_level']=='-50000','income_level']=0
test_df.loc[test_df['income_level']=='50000+.','income_level']=1

In [5]:

##检查样本不均衡程度
a=train_df['income_level'].sum()*100.0/train_df['income_level'].count()
b=test_df['income_level'].sum()*100.0/test_df['income_level'].count()
print ('train_df  (1,0):(%s,%s)'%(a,100-a))
print ('test_df  (1,0):(%s,%s)'%(b,100-b))

train_df  (1,0):(6.20580083499,93.794199165)
test_df  (1,0):(6.20075780357,93.7992421964)

观察数据

In [6]:

train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 199523 entries, 0 to 199522
Data columns (total 41 columns):
age                                 199523 non-null int64
class_of_worker                     199523 non-null object
industry_code                       199523 non-null int64
occupation_code                     199523 non-null int64
education                           199523 non-null object
wage_per_hour                       199523 non-null int64
enrolled_in_edu_inst_lastwk         199523 non-null object
marital_status                      199523 non-null object
major_industry_code                 199523 non-null object
major_occupation_code               199523 non-null object
race                                199523 non-null object
hispanic_origin                     198649 non-null object
sex                                 199523 non-null object
member_of_labor_union               199523 non-null object
reason_for_unemployment             199523 non-null object
full_parttime_employment_stat       199523 non-null object
capital_gains                       199523 non-null int64
capital_losses                      199523 non-null int64
dividend_from_Stocks                199523 non-null int64
tax_filer_status                    199523 non-null object
region_of_previous_residence        199523 non-null object
state_of_previous_residence         198815 non-null object
d_household_family_stat             199523 non-null object
d_household_summary                 199523 non-null object
migration_msa                       99827 non-null object
migration_reg                       99827 non-null object
migration_within_reg                99827 non-null object
live_1_year_ago                     199523 non-null object
migration_sunbelt                   99827 non-null object
num_person_Worked_employer          199523 non-null int64
family_members_under_18             199523 non-null object
country_father                      192810 non-null object
country_mother                      193404 non-null object
country_self                        196130 non-null object
citizenship                         199523 non-null object
business_or_self_employed           199523 non-null int64
fill_questionnaire_veteran_admin    199523 non-null object
veterans_benefits                   199523 non-null int64
weeks_worked_in_year                199523 non-null int64
year                                199523 non-null int64
income_level                        199523 non-null int64
dtypes: int64(13), object(28)
memory usage: 62.4+ MB

**数值数据观察

In [7]:

import matplotlib.pyplot as plt
def num_tr(filed,n):fig=plt.figure(figsize=(10,5))train_df[filed].hist(bins=n) plt.title('%s'%filed) plt.show()

1.age

1.1 分布

In [8]:

num_tr('age',100)

如图可以观察到，年龄在0-90之间，并且随着年龄的增大，人数减少

我猜测20岁以下和步入工作不久的人，比较不可能>50K，但是也不一定
现在将其分组，0-22,22-35,35-60,60-90,对应编码为：0,1,2,3(22岁为本科毕业平均年龄，35为工作初期（前10年），60岁为退休年龄）

In [9]:

'''
#创建年龄分组字段
labels=[0,1,2,3,4,5,6,7,8,9]
train_df['age_class']=pd.cut(train_df['age'],bins=[-1,10,20,30,40,50,60,70,80,90,100],labels=labels)
test_df['age_class']=pd.cut(test_df['age'],bins=[-1,10,20,30,40,50,60,70,80,90,100],labels=labels)
'''

Out[9]:

"\n#\xe5\x88\x9b\xe5\xbb\xba\xe5\xb9\xb4\xe9\xbe\x84\xe5\x88\x86\xe7\xbb\x84\xe5\xad\x97\xe6\xae\xb5\nlabels=[0,1,2,3,4,5,6,7,8,9]\ntrain_df['age_class']=pd.cut(train_df['age'],bins=[-1,10,20,30,40,50,60,70,80,90,100],labels=labels)\ntest_df['age_class']=pd.cut(test_df['age'],bins=[-1,10,20,30,40,50,60,70,80,90,100],labels=labels)\n"

1.2 年龄与目标变量的关系

收入水平为1的主要集中在30-50岁之间,并且可以看出，收入水平为1的人群年龄分布是接近正态的，均值为50.

In [10]:

'''
fig=plt.figure(figsize=(12,6))
train_df.groupby(['age_class','income_level'])['income_level'].count().unstack().plot(kind='bar')plt.title('income_level wrt age')
plt.show()

'''

Out[10]:

"\nfig=plt.figure(figsize=(12,6))\ntrain_df.groupby(['age_class','income_level'])['income_level'].count().unstack().plot(kind='bar')\n\nplt.title('income_level wrt age') \nplt.show()\n    \n"

In [11]:

#查看收入水平的人群的年龄分布
fig=plt.figure(figsize=(12,6))
train_df.age[train_df.income_level==0].plot(kind='kde')
train_df.age[train_df.income_level==1].plot(kind='kde')
plt.legend(('0','1'))
plt.show()

2.capital_losses&capital_gains

右偏数据。后续有待进一步分析。

In [12]:

fig=plt.figure(figsize=(8,4))
plt.subplot2grid((1,2),(0,0))
train_df.capital_gains.plot(kind='box')
plt.subplot2grid((1,2),(0,1))
train_df.capital_losses.plot(kind='box')
plt.show()

3.weeks_worked_in_year

水平为0的主要集中在0,50，而水平位1的则主要为50. 这里可以看出，水平为1的该变量几乎没有取值为1的。

In [13]:

#查看收入水平的人群的周工作时长分布
fig=plt.figure(figsize=(8,4))
plt.subplot2grid((1,2),(0,0))
train_df.weeks_worked_in_year[train_df.income_level==0].hist(bins=20)
plt.subplot2grid((1,2),(0,1))
train_df.weeks_worked_in_year[train_df.income_level==1].hist(bins=20,color='r')
plt.show()

4.divided_from_stocks

右偏数据。后续有待进一步分析。

In [14]:

fig=plt.figure(figsize=(12,6))
train_df.dividend_from_Stocks[train_df.income_level==0].hist(bins=100)
train_df.dividend_from_Stocks[train_df.income_level==1].hist(bins=100)
plt.legend(('0','1'))
plt.show()

5.num_person_Worked_employment

level为0的主要为0，level为1的则主要是6.

In [15]:

fig=plt.figure(figsize=(12,6))
#train_df.num_person_Worked_employer[train_df.income_level==0].hist(bins=100)
#train_df.num_person_Worked_employer[train_df.income_level==1].hist(bins=100)
train_df.groupby(['num_person_Worked_employer','income_level'])['income_level'].count().unstack().plot(kind='bar')
plt.legend(('0','1'))
plt.show()

<matplotlib.figure.Figure at 0x10584f28>

**分类数据观察

1.class_of_worker

虽然没有提供有关Not in universe类别的具体信息。我们假设这个答案是由填写人口普查数据而感到沮丧的人（由于任何原因）给出的。
这个变量看起来不平衡，即只有两个类别似乎占主导地位。在这种情况下，一个好的做法是将总分类频率的频率小于5％的水平组合起来。后续处理。

In [16]:

fig=plt.figure(figsize=(18,12))
train_df.groupby(['class_of_worker','income_level'])['income_level'].count().unstack().plot(kind='bar')
#plt.xticks(rotation=120)
plt.legend(('0','1'))
plt.show()

<matplotlib.figure.Figure at 0x1069ac18>

2.education

Bachelors degree学士学位有最多level为1的

In [17]:

fig=plt.figure(figsize=(18,8))
train_df.groupby(['education','income_level'])['income_level'].count().unstack().plot(kind='bar')
#plt.xticks(rotation=30)
plt.legend(('0','1'))
plt.show()

<matplotlib.figure.Figure at 0x106c4ac8>

3.marital_status

Married-civilian spouse present已婚且配偶在场的人中level=1最多

In [18]:

fig=plt.figure(figsize=(18,8))
train_df.groupby(['marital_status','income_level'])['income_level'].count().unstack().plot(kind='bar')
#plt.xticks(rotation=30)
plt.legend(('0','1'))
plt.show()

<matplotlib.figure.Figure at 0x106e53c8>

4.race种族

白人占统样本的多数，并且，白人的level=1也最多。

In [19]:

fig=plt.figure(figsize=(18,8))
train_df.groupby(['race','income_level'])['income_level'].count().unstack().plot(kind='bar')
#plt.xticks(rotation=30)
plt.legend(('0','1'))
plt.show()

<matplotlib.figure.Figure at 0x106f5358>

5.sex

样本中，整体女性人数占比较大，但是level为1的主要为男性。

In [20]:

fig=plt.figure(figsize=(18,8))
train_df.groupby(['sex','income_level'])['income_level'].count().unstack().plot(kind='bar')
#plt.xticks(rotation=30)
plt.legend(('0','1'))
plt.show()

<matplotlib.figure.Figure at 0xdcd7e10>

6.member_of_labor_union

两类都集中在Not in universe中

In [21]:

fig=plt.figure(figsize=(18,8))
train_df.groupby(['member_of_labor_union','income_level'])['income_level'].count().unstack().plot(kind='bar')
#plt.xticks(rotation=30)
plt.legend(('0','1'))
plt.show()

<matplotlib.figure.Figure at 0x10171630>

7.full_parttime_employment_stat

In [22]:

fig=plt.figure(figsize=(18,8))
train_df.groupby(['full_parttime_employment_stat','income_level'])['income_level'].count().unstack().plot(kind='bar')
#plt.xticks(rotation=30)
plt.legend(('0','1'))
plt.show()

<matplotlib.figure.Figure at 0xfd74160>

8.tax_filer_status

In [23]:

fig=plt.figure(figsize=(18,8))
train_df.groupby(['tax_filer_status','income_level'])['income_level'].count().unstack().plot(kind='bar')
#plt.xticks(rotation=30)
plt.legend(('0','1'))
plt.show()

<matplotlib.figure.Figure at 0x8c01e10>

9.business_or_self_employed

In [24]:

fig=plt.figure(figsize=(18,8))
train_df.groupby(['business_or_self_employed','income_level'])['income_level'].count().unstack().plot(kind='bar')
#plt.xticks(rotation=30)
plt.legend(('0','1'))
plt.show()

<matplotlib.figure.Figure at 0xfe743c8>

2. 数据预处理

2.1 缺失值处理

test数据没有缺失值,但是有?，首先将?置为空

In [25]:

s=pd.Series(train_df.isnull().sum())
print s
ss=pd.Series(test_df.isnull().sum())
print ss

age                                     0
class_of_worker                         0
industry_code                           0
occupation_code                         0
education                               0
wage_per_hour                           0
enrolled_in_edu_inst_lastwk             0
marital_status                          0
major_industry_code                     0
major_occupation_code                   0
race                                    0
hispanic_origin                       874
sex                                     0
member_of_labor_union                   0
reason_for_unemployment                 0
full_parttime_employment_stat           0
capital_gains                           0
capital_losses                          0
dividend_from_Stocks                    0
tax_filer_status                        0
region_of_previous_residence            0
state_of_previous_residence           708
d_household_family_stat                 0
d_household_summary                     0
migration_msa                       99696
migration_reg                       99696
migration_within_reg                99696
live_1_year_ago                         0
migration_sunbelt                   99696
num_person_Worked_employer              0
family_members_under_18                 0
country_father                       6713
country_mother                       6119
country_self                         3393
citizenship                             0
business_or_self_employed               0
fill_questionnaire_veteran_admin        0
veterans_benefits                       0
weeks_worked_in_year                    0
year                                    0
income_level                            0
dtype: int64
age                                     0
class_of_worker                         0
industry_code                           0
occupation_code                         0
education                               0
wage_per_hour                           0
enrolled_in_edu_inst_lastwk             0
marital_status                          0
major_industry_code                     0
major_occupation_code                   0
race                                    0
hispanic_origin                       405
sex                                     0
member_of_labor_union                   0
reason_for_unemployment                 0
full_parttime_employment_stat           0
capital_gains                           0
capital_losses                          0
dividend_from_Stocks                    0
tax_filer_status                        0
region_of_previous_residence            0
state_of_previous_residence           330
d_household_family_stat                 0
d_household_summary                     0
migration_msa                       49946
migration_reg                       49946
migration_within_reg                49946
live_1_year_ago                         0
migration_sunbelt                   49946
num_person_Worked_employer              0
family_members_under_18                 0
country_father                       3429
country_mother                       3072
country_self                         1764
citizenship                             0
business_or_self_employed               0
fill_questionnaire_veteran_admin        0
veterans_benefits                       0
weeks_worked_in_year                    0
year                                    0
income_level                            0
dtype: int64

计算出缺失百分比

In [26]:

##检查样本缺失比例
m=train_df.shape[0]
for i,j in s.iteritems():if j>0:print i,j*100.0/m
print '----------------------------'
#检查测试样本缺失比例
n=test_df.shape[0]
for i,j in ss.iteritems():if j>0:print i,j*100.0/n

hispanic_origin 0.438044736697
state_of_previous_residence 0.354846308446
migration_msa 49.9671717045
migration_reg 49.9671717045
migration_within_reg 49.9671717045
migration_sunbelt 49.9671717045
country_father 3.36452439067
country_mother 3.06681435223
country_self 1.70055582564
----------------------------
hispanic_origin 0.405966199555
state_of_previous_residence 0.330787273711
migration_msa 50.0651550691
migration_reg 50.0651550691
migration_within_reg 50.0651550691
migration_sunbelt 50.0651550691
country_father 3.43718048957
country_mother 3.07932880255
country_self 1.76820833584

删掉缺失值过多的列（4列接近50%）

In [27]:

del train_df['migration_msa']
del train_df['migration_reg']
del train_df['migration_within_reg']
del train_df['migration_sunbelt']del test_df['migration_msa']
del test_df['migration_reg']
del test_df['migration_within_reg']
del test_df['migration_sunbelt']

其他五个变量的缺失值占比极小，决定添加字段（不均衡样本尽量塑造数据而不是删除宝贵的数据）

In [28]:

train_df['hispanic_origin']= train_df['hispanic_origin'].fillna('others')
train_df['state_of_previous_residence']= train_df['state_of_previous_residence'].fillna('others')
train_df['country_father']= train_df['country_father'].fillna('others')
train_df['country_mother']= train_df['country_mother'].fillna('others')
train_df['country_self']= train_df['country_self'].fillna('others')test_df['hispanic_origin']= test_df['hispanic_origin'].fillna('others')
test_df['state_of_previous_residence']= test_df['state_of_previous_residence'].fillna('others')
test_df['country_father']= test_df['country_father'].fillna('others')
test_df['country_mother']= test_df['country_mother'].fillna('others')
test_df['country_self']= test_df['country_self'].fillna('others')

2.2 异常值处理

*变量转换处理极度右偏数据（对数处理）

In [29]:

def outliner(df,filed):df[filed]=np.log(df[filed]+1)df[filed].plot(kind='kde')

In [30]:

#训练数据
fig=plt.figure(figsize=(15,5))
plt.subplot2grid((2,2),(0,0))
outliner(train_df,'capital_losses') #.capital_losses&capital_gains
plt.subplot2grid((2,2),(0,1))
outliner(train_df,'capital_gains')
plt.subplot2grid((2,2),(1,0))
outliner(train_df,'dividend_from_Stocks')
plt.show()

In [31]:

##测试数据
fig=plt.figure(figsize=(15,5))
plt.subplot2grid((2,2),(0,0))
outliner(test_df,'capital_losses') #.capital_losses&capital_gains
plt.subplot2grid((2,2),(0,1))
outliner(test_df,'capital_gains')
plt.subplot2grid((2,2),(1,0))
outliner(test_df,'dividend_from_Stocks')
plt.show()

2.3 哑编码

In [32]:

def dummy_encode(df,filed,a):dummies=pd.get_dummies(df[filed],prefix=a)len=dummies.shape[1]-1a= dummies.iloc[:,0:len]b=pd.concat([df, a], axis=1)del b[filed]return b

In [33]:

print train_df.shape,test_df.shape

(199523, 37) (99762, 37)

In [34]:

df_all=pd.concat([train_df,test_df])

In [35]:

#对训练和测试数据进行哑编码
df_all=dummy_encode(df_all,'fill_questionnaire_veteran_admin','fill_questionnaire_veteran_admin')
df_all=dummy_encode(df_all,'citizenship','citizenship')
df_all=dummy_encode(df_all,'country_self','country_self')
df_all=dummy_encode(df_all,'country_mother','country_mother')
df_all=dummy_encode(df_all,'country_father','country_father')
df_all=dummy_encode(df_all,'family_members_under_18','family_members_under_18')
df_all=dummy_encode(df_all,'live_1_year_ago','live_1_year_ago')
df_all=dummy_encode(df_all,'d_household_summary','d_household_summary')
df_all=dummy_encode(df_all,'class_of_worker','class_of_worker')
df_all=dummy_encode(df_all,'education','education')
df_all=dummy_encode(df_all,'enrolled_in_edu_inst_lastwk','enrolled_in_edu_inst_lastwk')
df_all=dummy_encode(df_all,'marital_status','marital_status')
df_all=dummy_encode(df_all,'major_industry_code','major_industry_code')
df_all=dummy_encode(df_all,'major_occupation_code','major_occupation_code')
df_all=dummy_encode(df_all,'race','race')
df_all=dummy_encode(df_all,'hispanic_origin','hispanic_origin')
df_all=dummy_encode(df_all,'sex','sex')
df_all=dummy_encode(df_all,'member_of_labor_union','member_of_labor_union')
df_all=dummy_encode(df_all,'reason_for_unemployment','reason_for_unemployment')
df_all=dummy_encode(df_all,'full_parttime_employment_stat','full_parttime_employment_stat')
df_all=dummy_encode(df_all,'tax_filer_status','tax_filer_status')
df_all=dummy_encode(df_all,'region_of_previous_residence','region_of_previous_residence')
df_all=dummy_encode(df_all,'state_of_previous_residence','state_of_previous_residence')
df_all=dummy_encode(df_all,'d_household_family_stat','d_household_family_stat')

In [36]:

train_df=df_all.iloc[0:199523,:]
test_df=df_all.iloc[199523:,:]
print train_df.shape,test_df.shape

(199523, 352) (99762, 352)

In [37]:

#test_df.to_csv('testooooo.csv')
#train_df.to_csv('trainooooo.csv')

3. 特征选择

随机森林做特征选择

In [38]:

##将目标变量提到最后一列
Y=train_df['income_level']
del train_df['income_level']
train_df['income_level']=YYT=test_df['income_level']
del test_df['income_level']
test_df['income_level']=YT

D:\Anaconda2\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copyafter removing the cwd from sys.path.
D:\Anaconda2\lib\site-packages\ipykernel_launcher.py:8: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [39]:

y=Y
X=train_df.iloc[:,0:351]

In [73]:

from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as pltselected_feat_names=set()
for i in range(10):                           #这里我们进行十次循环取交集tmp = set()rfc = RandomForestClassifier(n_jobs=-1)rfc.fit(X, y)#print("training finished")importances = rfc.feature_importances_indices = np.argsort(importances)[::-1]   # 降序排列S={}for f in range(X.shape[1]):if  importances[indices[f]] >=0.0001:tmp.add(X.columns[indices[f]])S[X.columns[indices[f]]]=importances[indices[f]]#print("%2d) %-*s %f" % (f + 1, 30, X.columns[indices[f]], importances[indices[f]]))selected_feat_names |= tmpimp_fea=pd.Series(S)
print(len(selected_feat_names), "features are selected")

(285, 'features are selected')

In [41]:

train_new=train_df[['income_level']]
test_new=test_df[['income_level']]
for i in selected_feat_names:train_new[i]=train_df[i]try :test_new[i]=test_df[i]except Exception :print '----------------'print idel train_new[i]print train_new.shape,test_new.shape

D:\Anaconda2\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copyafter removing the cwd from sys.path.
D:\Anaconda2\lib\site-packages\ipykernel_launcher.py:6: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

(199523, 292) (99762, 292)

In [42]:

##将目标变量提到最后一列
Y=train_new['income_level']
del train_new['income_level']
train_new['income_level']=YYT=test_new['income_level']
del test_new['income_level']
test_new['income_level']=YT

D:\Anaconda2\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copyafter removing the cwd from sys.path.
D:\Anaconda2\lib\site-packages\ipykernel_launcher.py:8: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [43]:

#train_new.to_csv('train_new.csv')
#test_new.to_csv('test_new.csv')

4. 机器学习

首先:不均衡技术处理（欠采样和过采样技术）
其次：模型选择与训练(xgboost)
最后：调参，调参我希望在保证精确度》0.94的前提下，AUC越大越好

4.1不均衡技术:欠采样和过采样

4.1.1欠采样

train_df  (1,0):(6.20580083499,93.794199165)
test_df  (1,0):(6.20075780357,93.7992421964)
正例：12382 负例：187141
抽样比例：25%
发生率为：21%

In [44]:

def down_sample(df):df1=df[df['income_level']==1]#正例df2=df[df['income_level']==0]##负例df3=df2.sample(frac=0.25)##抽负例return pd.concat([df1,df3],ignore_index=True)

In [45]:

down_train_df=down_sample(train_df)
down_train_new=down_sample(train_new)

4.1.2过采样

train_df  (1,0):(6.20580083499,93.794199165)
test_df  (1,0):(6.20075780357,93.7992421964)
正例：12382 负例：187141
正例复制5次
发生率为：25%

In [46]:

def up_sample(df):df1=df[df['income_level']==1]#正例df2=df[df['income_level']==0]##负例df3=pd.concat([df1,df1,df1,df1,df1],ignore_index=True)return pd.concat([df2,df3],ignore_index=True)

In [47]:

up_train_df=up_sample(train_df)
up_train_new=up_sample(train_new)

4.2 Xgboost 进行机器学习

In [48]:

import xgboost as xgb
from xgboost.sklearn import XGBClassifierfrom sklearn import metrics
from sklearn.cross_validation import train_test_split
#记录程序运行时间
import time ##定义模型参数
param = {}
# use logistic regression loss
param['objective'] = 'binary:logistic'
# scale weight of positive examples
param['scale_pos_weight'] = 1
param['bst:eta'] = 0.2
param['bst:max_depth'] = 6
param['eval_metric'] = 'logloss'
param['silent'] = 1
param['nthread'] = 10Threshold=0.5

D:\Anaconda2\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20."This module will be removed in 0.20.", DeprecationWarning)

In [74]:

def xgb_model(train,tests,list,pam,Threshold):train_xy,val = train_test_split(train, test_size = 0.3,random_state=1)#random_state is of big influence for val-aucy = train_xy['income_level']X = train_xy.drop(['income_level'],axis=1)val_y = val['income_level']val_X = val.drop(['income_level'],axis=1)weight1 = np.ones(len(y))weight2 = np.ones(len(val_y))xgb_val = xgb.DMatrix(val_X,label=val_y,weight = weight2)xgb_train = xgb.DMatrix(X, label=y,weight = weight1)#,weight = weight1test_y = tests['income_level']test_X = tests.drop(['income_level'],axis=1)xgb_test = xgb.DMatrix(test_X)##不要labelwatchlist = [(xgb_train, 'train'),(xgb_val, 'val')]num_round = 100 # 迭代次数# training model print ("training xgboost")threads = list##调参for i in threads:param[pam] = itmp = time.time()plst = param.items()+[('eval_metric', 'ams@0.15')]model = xgb.train( plst, xgb_train, num_round, watchlist,verbose_eval=False )preds = model.predict(xgb_test,ntree_limit=model.best_ntree_limit)print pam,iprint ("XGBoost with %d thread costs: %s seconds" % (i, str(time.time() - tmp)))for j in range(len(preds)):if preds[j]>=Threshold :preds[j]=1else :preds[j]=0print 'AUC: %.4f' % metrics.roc_auc_score(test_y,preds)print 'ACC: %.4f' % metrics.accuracy_score(test_y,preds)return model

**分别在特征筛选后的train_new,以及在此基础上的上采样和下采样的数据集上进行调参，以此选出最佳模型

4.2.1max_depth调参

list=[6,7,8,9]
上采样时：树深为9,10可以继续调参 因为跨过了ACC为0.94
下采样时，选择树深为4

In [54]:

### 未经过重采样的训练集
list1=[4,6,8,9,10]
pam='max_depth'xgb_model(train_new,test_new,list1,pam,Threshold)

training xgboost
max_depth 4
XGBoost with 4 thread costs: 56.1489999294 seconds
AUC: 0.7273
ACC: 0.9572
max_depth 6
XGBoost with 6 thread costs: 70.1800000668 seconds
AUC: 0.7364
ACC: 0.9581
max_depth 8
XGBoost with 8 thread costs: 90.8049998283 seconds
AUC: 0.7435
ACC: 0.9585
max_depth 9
XGBoost with 9 thread costs: 101.267000198 seconds
AUC: 0.7387
ACC: 0.9579
max_depth 10
XGBoost with 10 thread costs: 112.95600009 seconds
AUC: 0.7417
ACC: 0.9577

In [55]:

### 下采样的训练集
xgb_model(down_train_new,test_new,list1,pam,Threshold)

training xgboost
max_depth 4
XGBoost with 4 thread costs: 14.1289999485 seconds
AUC: 0.8398
ACC: 0.9395
max_depth 6
XGBoost with 6 thread costs: 19.8990001678 seconds
AUC: 0.8398
ACC: 0.9385
max_depth 8
XGBoost with 8 thread costs: 26.1499998569 seconds
AUC: 0.8410
ACC: 0.9376
max_depth 9
XGBoost with 9 thread costs: 29.2709999084 seconds
AUC: 0.8396
ACC: 0.9373
max_depth 10
XGBoost with 10 thread costs: 32.6649999619 seconds
AUC: 0.8380
ACC: 0.9368

In [56]:

### 上采样的训练集
xgb_model(up_train_new,test_new,list1,pam,Threshold)

training xgboost
max_depth 4
XGBoost with 4 thread costs: 66.6740000248 seconds
AUC: 0.8555
ACC: 0.9333
max_depth 6
XGBoost with 6 thread costs: 88.0920000076 seconds
AUC: 0.8545
ACC: 0.9352
max_depth 8
XGBoost with 8 thread costs: 113.984999895 seconds
AUC: 0.8507
ACC: 0.9380
max_depth 9
XGBoost with 9 thread costs: 127.167999983 seconds
AUC: 0.8485
ACC: 0.9398
max_depth 10
XGBoost with 10 thread costs: 142.092000008 seconds
AUC: 0.8416
ACC: 0.9410

4.2.2scale_pos_weight调参

list=[0.2,0.4,0.5,0.6,0.8,1]
经过上个步骤，上下采样的AUC都有不错的表现，但是ACC不足
而未经重采样的数据集的AUC的峰值大约在0.74左右，需要舍弃这个数据集
下采样则是ACC达不到0.94
下面着重在树深为9,10 的基础上调整上采样的ACC结果：上采样选择scale_pos_weight=1.0，树深为9，AUC:0.8485,ACC：0.9398

In [59]:

### 上采样
param['max_depth'] = 10
list2=[0.8,0.9,1.0,1.1,1.2]
pam='scale_pos_weight'
xgb_model(up_train_new,test_new,list2,pam,Threshold)

training xgboost
scale_pos_weight 0.8
XGBoost with 0 thread costs: 145.473999977 seconds
AUC: 0.8300
ACC: 0.9456
scale_pos_weight 0.9
XGBoost with 0 thread costs: 141.223999977 seconds
AUC: 0.8359
ACC: 0.9430
scale_pos_weight 1.0
XGBoost with 1 thread costs: 143.575999975 seconds
AUC: 0.8416
ACC: 0.9410
scale_pos_weight 1.1
XGBoost with 1 thread costs: 171.828999996 seconds
AUC: 0.8442
ACC: 0.9392
scale_pos_weight 1.2
XGBoost with 1 thread costs: 165.302000046 seconds
AUC: 0.8500
ACC: 0.9364

In [60]:

### 上采样
param['max_depth'] = 9
list2=[0.8,0.9,1.0,1.1,1.2]
pam='scale_pos_weight'
Threshold=0.5
xgb_model(up_train_new,test_new,list2,pam,Threshold)

training xgboost
scale_pos_weight 0.8
XGBoost with 0 thread costs: 142.86500001 seconds
AUC: 0.8362
ACC: 0.9445
scale_pos_weight 0.9
XGBoost with 0 thread costs: 145.427999973 seconds
AUC: 0.8412
ACC: 0.9413
scale_pos_weight 1.0
XGBoost with 1 thread costs: 141.220000029 seconds
AUC: 0.8485
ACC: 0.9398
scale_pos_weight 1.1
XGBoost with 1 thread costs: 144.178000212 seconds
AUC: 0.8519
ACC: 0.9363
scale_pos_weight 1.2
XGBoost with 1 thread costs: 143.682999849 seconds
AUC: 0.8547
ACC: 0.9349

4.2.3阈值调参

Threshold 0.45-0.55,步进0.01
最终选择：Threshold= 0.51
此时：
AUC: 0.8466
ACC: 0.9409

In [64]:

for m in np.arange(0.45,0.55,0.01):param['scale_pos_weight'] = 1.0list3=[9]pam='max_depth'Threshold=mprint 'Threshold=',mxgb_model(up_train_new,test_new,list3,pam,Threshold)

Threshold= 0.45
training xgboost
max_depth 9
XGBoost with 9 thread costs: 162.524000168 seconds
AUC: 0.8565
ACC: 0.9337
Threshold= 0.46
training xgboost
max_depth 9
XGBoost with 9 thread costs: 160.478999853 seconds
AUC: 0.8555
ACC: 0.9352
Threshold= 0.47
training xgboost
max_depth 9
XGBoost with 9 thread costs: 153.284000158 seconds
AUC: 0.8537
ACC: 0.9363
Threshold= 0.48
training xgboost
max_depth 9
XGBoost with 9 thread costs: 153.253999949 seconds
AUC: 0.8523
ACC: 0.9375
Threshold= 0.49
training xgboost
max_depth 9
XGBoost with 9 thread costs: 152.437000036 seconds
AUC: 0.8499
ACC: 0.9384
Threshold= 0.5
training xgboost
max_depth 9
XGBoost with 9 thread costs: 143.042000055 seconds
AUC: 0.8485
ACC: 0.9398
Threshold= 0.51
training xgboost
max_depth 9
XGBoost with 9 thread costs: 136.401000023 seconds
AUC: 0.8466
ACC: 0.9409
Threshold= 0.52
training xgboost
max_depth 9
XGBoost with 9 thread costs: 139.329999924 seconds
AUC: 0.8445
ACC: 0.9418
Threshold= 0.53
training xgboost
max_depth 9
XGBoost with 9 thread costs: 138.738999844 seconds
AUC: 0.8425
ACC: 0.9428
Threshold= 0.54
training xgboost
max_depth 9
XGBoost with 9 thread costs: 156.219000101 seconds
AUC: 0.8404
ACC: 0.9436
Threshold= 0.55
training xgboost
max_depth 9
XGBoost with 9 thread costs: 164.50999999 seconds
AUC: 0.8381
ACC: 0.9446

4.3 重要特征可视化

In [72]:

from xgboost import plot_importance
import matplotlib.pyplot as plt
from graphviz import Digraph
import pydot

In [76]:

param['scale_pos_weight'] = 1.0
list3=[9]
pam='max_depth'
Threshold=0.51
model=xgb_model(up_train_new,test_new,list3,pam,Threshold)

training xgboost
max_depth 9
XGBoost with 9 thread costs: 156.400000095 seconds
AUC: 0.8466
ACC: 0.9409

In [102]:

imp_feat=imp_fea.sort_values()[::-1]
feat_imp=imp_feat[:30]

In [103]:

feat_imp.plot(kind='bar')

Out[103]:

<matplotlib.axes._subplots.AxesSubplot at 0xe8124a8>