1.项目分析与设计

该项目通过美国人口普查数据训练一个模型来预测美国人口收入水平。数据集上包含199523个训练数据和99762个测试数据,各包含了41个属性。经分析,该数据包含了人口统计信息、年龄、贷款信息、国籍、种族等信息。属性数据中有包含空值和有偏分布等问题,处理思路如下:
1.读取数据,观察特征及其分布
2.分析缺失情况,处理缺失值
3.异常值处理
4.对分类变量进行哑编码
5.用随机森林进行重要特征筛选
6.重采样对不均衡数据进行处理
7.构建XGBOOST模型,并进行建模分析预测

2. 数据探索

In [1]:
import numpy as np
import pandas as pd

In [2]:
train_df=pd.read_csv('train.csv')
test_df=pd.read_csv('test.csv')
print ('train_df:%s,%s'%train_df.shape)
print ('test_df:%s,%s'%test_df.shape)

train_df:199523,41
test_df:99762,41

In [3]:
##检查因变量train_df.income_level.unique()
test_df.income_level.unique()

Out[3]:
array(['-50000', '50000+.'], dtype=object)

In [4]:
#为了便于分析,将变量编码为0,1
train_df.loc[train_df['income_level']==-50000,'income_level']=0
train_df.loc[train_df['income_level']== 50000,'income_level']=1
test_df.loc[test_df['income_level']=='-50000','income_level']=0
test_df.loc[test_df['income_level']=='50000+.','income_level']=1

In [5]:
##检查样本不均衡程度
a=train_df['income_level'].sum()*100.0/train_df['income_level'].count()
b=test_df['income_level'].sum()*100.0/test_df['income_level'].count()
print ('train_df  (1,0):(%s,%s)'%(a,100-a))
print ('test_df  (1,0):(%s,%s)'%(b,100-b))

train_df  (1,0):(6.20580083499,93.794199165)
test_df  (1,0):(6.20075780357,93.7992421964)

观察数据

In [6]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 199523 entries, 0 to 199522
Data columns (total 41 columns):
age                                 199523 non-null int64
class_of_worker                     199523 non-null object
industry_code                       199523 non-null int64
occupation_code                     199523 non-null int64
education                           199523 non-null object
wage_per_hour                       199523 non-null int64
enrolled_in_edu_inst_lastwk         199523 non-null object
marital_status                      199523 non-null object
major_industry_code                 199523 non-null object
major_occupation_code               199523 non-null object
race                                199523 non-null object
hispanic_origin                     198649 non-null object
sex                                 199523 non-null object
member_of_labor_union               199523 non-null object
reason_for_unemployment             199523 non-null object
full_parttime_employment_stat       199523 non-null object
capital_gains                       199523 non-null int64
capital_losses                      199523 non-null int64
dividend_from_Stocks                199523 non-null int64
tax_filer_status                    199523 non-null object
region_of_previous_residence        199523 non-null object
state_of_previous_residence         198815 non-null object
d_household_family_stat             199523 non-null object
d_household_summary                 199523 non-null object
migration_msa                       99827 non-null object
migration_reg                       99827 non-null object
migration_within_reg                99827 non-null object
live_1_year_ago                     199523 non-null object
migration_sunbelt                   99827 non-null object
num_person_Worked_employer          199523 non-null int64
family_members_under_18             199523 non-null object
country_father                      192810 non-null object
country_mother                      193404 non-null object
country_self                        196130 non-null object
citizenship                         199523 non-null object
business_or_self_employed           199523 non-null int64
fill_questionnaire_veteran_admin    199523 non-null object
veterans_benefits                   199523 non-null int64
weeks_worked_in_year                199523 non-null int64
year                                199523 non-null int64
income_level                        199523 non-null int64
dtypes: int64(13), object(28)
memory usage: 62.4+ MB

**数值数据观察

In [7]:
import matplotlib.pyplot as plt
def num_tr(filed,n):fig=plt.figure(figsize=(10,5))train_df[filed].hist(bins=n) plt.title('%s'%filed) plt.show()

1.age
1.1 分布
In [8]:
num_tr('age',100)

如图可以观察到,年龄在0-90之间,并且随着年龄的增大,人数减少

我猜测20岁以下和步入工作不久的人,比较不可能>50K,但是也不一定
现在将其分组,0-22,22-35,35-60,60-90,对应编码为:0,1,2,3(22岁为本科毕业平均年龄,35为工作初期(前10年),60岁为退休年龄)
In [9]:
'''
#创建年龄分组字段
labels=[0,1,2,3,4,5,6,7,8,9]
train_df['age_class']=pd.cut(train_df['age'],bins=[-1,10,20,30,40,50,60,70,80,90,100],labels=labels)
test_df['age_class']=pd.cut(test_df['age'],bins=[-1,10,20,30,40,50,60,70,80,90,100],labels=labels)
'''

Out[9]:
"\n#\xe5\x88\x9b\xe5\xbb\xba\xe5\xb9\xb4\xe9\xbe\x84\xe5\x88\x86\xe7\xbb\x84\xe5\xad\x97\xe6\xae\xb5\nlabels=[0,1,2,3,4,5,6,7,8,9]\ntrain_df['age_class']=pd.cut(train_df['age'],bins=[-1,10,20,30,40,50,60,70,80,90,100],labels=labels)\ntest_df['age_class']=pd.cut(test_df['age'],bins=[-1,10,20,30,40,50,60,70,80,90,100],labels=labels)\n"

1.2 年龄与目标变量的关系
收入水平为1的主要集中在30-50岁之间,并且可以看出,收入水平为1的人群年龄分布是接近正态的,均值为50.
In [10]:
'''
fig=plt.figure(figsize=(12,6))
train_df.groupby(['age_class','income_level'])['income_level'].count().unstack().plot(kind='bar')plt.title('income_level wrt age')
plt.show()

'''

Out[10]:
"\nfig=plt.figure(figsize=(12,6))\ntrain_df.groupby(['age_class','income_level'])['income_level'].count().unstack().plot(kind='bar')\n\nplt.title('income_level wrt age') \nplt.show()\n    \n"

In [11]:
#查看收入水平的人群的年龄分布
fig=plt.figure(figsize=(12,6))
train_df.age[train_df.income_level==0].plot(kind='kde')
train_df.age[train_df.income_level==1].plot(kind='kde')
plt.legend(('0','1'))
plt.show()

2.capital_losses&capital_gains
右偏数据。后续有待进一步分析。
In [12]:
fig=plt.figure(figsize=(8,4))
plt.subplot2grid((1,2),(0,0))
train_df.capital_gains.plot(kind='box')
plt.subplot2grid((1,2),(0,1))
train_df.capital_losses.plot(kind='box')
plt.show()

3.weeks_worked_in_year

水平为0的主要集中在0,50,而水平位1的则主要为50. 这里可以看出,水平为1的该变量几乎没有取值为1的。

In [13]:
#查看收入水平的人群的周工作时长分布
fig=plt.figure(figsize=(8,4))
plt.subplot2grid((1,2),(0,0))
train_df.weeks_worked_in_year[train_df.income_level==0].hist(bins=20)
plt.subplot2grid((1,2),(0,1))
train_df.weeks_worked_in_year[train_df.income_level==1].hist(bins=20,color='r')
plt.show()

4.divided_from_stocks
右偏数据。后续有待进一步分析。
In [14]:
fig=plt.figure(figsize=(12,6))
train_df.dividend_from_Stocks[train_df.income_level==0].hist(bins=100)
train_df.dividend_from_Stocks[train_df.income_level==1].hist(bins=100)
plt.legend(('0','1'))
plt.show()

5.num_person_Worked_employment
level为0的主要为0,level为1的则主要是6.
In [15]:
fig=plt.figure(figsize=(12,6))
#train_df.num_person_Worked_employer[train_df.income_level==0].hist(bins=100)
#train_df.num_person_Worked_employer[train_df.income_level==1].hist(bins=100)
train_df.groupby(['num_person_Worked_employer','income_level'])['income_level'].count().unstack().plot(kind='bar')
plt.legend(('0','1'))
plt.show()

<matplotlib.figure.Figure at 0x10584f28>

**分类数据观察

1.class_of_worker
虽然没有提供有关Not in universe类别的具体信息。我们假设这个答案是由填写人口普查数据而感到沮丧的人(由于任何原因)给出的。
这个变量看起来不平衡,即只有两个类别似乎占主导地位。在这种情况下,一个好的做法是将总分类频率的频率小于5%的水平组合起来。后续处理。
In [16]:
fig=plt.figure(figsize=(18,12))
train_df.groupby(['class_of_worker','income_level'])['income_level'].count().unstack().plot(kind='bar')
#plt.xticks(rotation=120)
plt.legend(('0','1'))
plt.show()

<matplotlib.figure.Figure at 0x1069ac18>

2.education
Bachelors degree学士学位有最多level为1的
In [17]:
fig=plt.figure(figsize=(18,8))
train_df.groupby(['education','income_level'])['income_level'].count().unstack().plot(kind='bar')
#plt.xticks(rotation=30)
plt.legend(('0','1'))
plt.show()

<matplotlib.figure.Figure at 0x106c4ac8>

3.marital_status
Married-civilian spouse present​已婚且配偶在场的人中level=1最多
In [18]:
fig=plt.figure(figsize=(18,8))
train_df.groupby(['marital_status','income_level'])['income_level'].count().unstack().plot(kind='bar')
#plt.xticks(rotation=30)
plt.legend(('0','1'))
plt.show()

<matplotlib.figure.Figure at 0x106e53c8>

4.race种族
白人占统样本的多数,并且,白人的level=1也最多。
In [19]:
fig=plt.figure(figsize=(18,8))
train_df.groupby(['race','income_level'])['income_level'].count().unstack().plot(kind='bar')
#plt.xticks(rotation=30)
plt.legend(('0','1'))
plt.show()

<matplotlib.figure.Figure at 0x106f5358>

5.sex
样本中,整体女性人数占比较大,但是level为1的主要为男性。
In [20]:
fig=plt.figure(figsize=(18,8))
train_df.groupby(['sex','income_level'])['income_level'].count().unstack().plot(kind='bar')
#plt.xticks(rotation=30)
plt.legend(('0','1'))
plt.show()

<matplotlib.figure.Figure at 0xdcd7e10>

6.member_of_labor_union
两类都集中在Not in universe中
In [21]:
fig=plt.figure(figsize=(18,8))
train_df.groupby(['member_of_labor_union','income_level'])['income_level'].count().unstack().plot(kind='bar')
#plt.xticks(rotation=30)
plt.legend(('0','1'))
plt.show()

<matplotlib.figure.Figure at 0x10171630>

7.full_parttime_employment_stat
In [22]:
fig=plt.figure(figsize=(18,8))
train_df.groupby(['full_parttime_employment_stat','income_level'])['income_level'].count().unstack().plot(kind='bar')
#plt.xticks(rotation=30)
plt.legend(('0','1'))
plt.show()

<matplotlib.figure.Figure at 0xfd74160>

8.tax_filer_status
In [23]:
fig=plt.figure(figsize=(18,8))
train_df.groupby(['tax_filer_status','income_level'])['income_level'].count().unstack().plot(kind='bar')
#plt.xticks(rotation=30)
plt.legend(('0','1'))
plt.show()

<matplotlib.figure.Figure at 0x8c01e10>

9.business_or_self_employed
In [24]:
fig=plt.figure(figsize=(18,8))
train_df.groupby(['business_or_self_employed','income_level'])['income_level'].count().unstack().plot(kind='bar')
#plt.xticks(rotation=30)
plt.legend(('0','1'))
plt.show()

<matplotlib.figure.Figure at 0xfe743c8>

2. 数据预处理

2.1 缺失值处理

test数据没有缺失值,但是有?,首先将?置为空
In [25]:
s=pd.Series(train_df.isnull().sum())
print s
ss=pd.Series(test_df.isnull().sum())
print ss

age                                     0
class_of_worker                         0
industry_code                           0
occupation_code                         0
education                               0
wage_per_hour                           0
enrolled_in_edu_inst_lastwk             0
marital_status                          0
major_industry_code                     0
major_occupation_code                   0
race                                    0
hispanic_origin                       874
sex                                     0
member_of_labor_union                   0
reason_for_unemployment                 0
full_parttime_employment_stat           0
capital_gains                           0
capital_losses                          0
dividend_from_Stocks                    0
tax_filer_status                        0
region_of_previous_residence            0
state_of_previous_residence           708
d_household_family_stat                 0
d_household_summary                     0
migration_msa                       99696
migration_reg                       99696
migration_within_reg                99696
live_1_year_ago                         0
migration_sunbelt                   99696
num_person_Worked_employer              0
family_members_under_18                 0
country_father                       6713
country_mother                       6119
country_self                         3393
citizenship                             0
business_or_self_employed               0
fill_questionnaire_veteran_admin        0
veterans_benefits                       0
weeks_worked_in_year                    0
year                                    0
income_level                            0
dtype: int64
age                                     0
class_of_worker                         0
industry_code                           0
occupation_code                         0
education                               0
wage_per_hour                           0
enrolled_in_edu_inst_lastwk             0
marital_status                          0
major_industry_code                     0
major_occupation_code                   0
race                                    0
hispanic_origin                       405
sex                                     0
member_of_labor_union                   0
reason_for_unemployment                 0
full_parttime_employment_stat           0
capital_gains                           0
capital_losses                          0
dividend_from_Stocks                    0
tax_filer_status                        0
region_of_previous_residence            0
state_of_previous_residence           330
d_household_family_stat                 0
d_household_summary                     0
migration_msa                       49946
migration_reg                       49946
migration_within_reg                49946
live_1_year_ago                         0
migration_sunbelt                   49946
num_person_Worked_employer              0
family_members_under_18                 0
country_father                       3429
country_mother                       3072
country_self                         1764
citizenship                             0
business_or_self_employed               0
fill_questionnaire_veteran_admin        0
veterans_benefits                       0
weeks_worked_in_year                    0
year                                    0
income_level                            0
dtype: int64

计算出缺失百分比
In [26]:
##检查样本缺失比例
m=train_df.shape[0]
for i,j in s.iteritems():if j>0:print i,j*100.0/m
print '----------------------------'
#检查测试样本缺失比例
n=test_df.shape[0]
for i,j in ss.iteritems():if j>0:print i,j*100.0/n

hispanic_origin 0.438044736697
state_of_previous_residence 0.354846308446
migration_msa 49.9671717045
migration_reg 49.9671717045
migration_within_reg 49.9671717045
migration_sunbelt 49.9671717045
country_father 3.36452439067
country_mother 3.06681435223
country_self 1.70055582564
----------------------------
hispanic_origin 0.405966199555
state_of_previous_residence 0.330787273711
migration_msa 50.0651550691
migration_reg 50.0651550691
migration_within_reg 50.0651550691
migration_sunbelt 50.0651550691
country_father 3.43718048957
country_mother 3.07932880255
country_self 1.76820833584

删掉缺失值过多的列(4列接近50%)
In [27]:
del train_df['migration_msa']
del train_df['migration_reg']
del train_df['migration_within_reg']
del train_df['migration_sunbelt']del test_df['migration_msa']
del test_df['migration_reg']
del test_df['migration_within_reg']
del test_df['migration_sunbelt']

其他五个变量的缺失值占比极小,决定添加字段(不均衡样本尽量塑造数据而不是删除宝贵的数据)
In [28]:
train_df['hispanic_origin']= train_df['hispanic_origin'].fillna('others')
train_df['state_of_previous_residence']= train_df['state_of_previous_residence'].fillna('others')
train_df['country_father']= train_df['country_father'].fillna('others')
train_df['country_mother']= train_df['country_mother'].fillna('others')
train_df['country_self']= train_df['country_self'].fillna('others')test_df['hispanic_origin']= test_df['hispanic_origin'].fillna('others')
test_df['state_of_previous_residence']= test_df['state_of_previous_residence'].fillna('others')
test_df['country_father']= test_df['country_father'].fillna('others')
test_df['country_mother']= test_df['country_mother'].fillna('others')
test_df['country_self']= test_df['country_self'].fillna('others')

2.2 异常值处理

*变量转换处理极度右偏数据(对数处理)
In [29]:
def outliner(df,filed):df[filed]=np.log(df[filed]+1)df[filed].plot(kind='kde')

In [30]:
#训练数据
fig=plt.figure(figsize=(15,5))
plt.subplot2grid((2,2),(0,0))
outliner(train_df,'capital_losses') #.capital_losses&capital_gains
plt.subplot2grid((2,2),(0,1))
outliner(train_df,'capital_gains')
plt.subplot2grid((2,2),(1,0))
outliner(train_df,'dividend_from_Stocks')
plt.show()

In [31]:
##测试数据
fig=plt.figure(figsize=(15,5))
plt.subplot2grid((2,2),(0,0))
outliner(test_df,'capital_losses') #.capital_losses&capital_gains
plt.subplot2grid((2,2),(0,1))
outliner(test_df,'capital_gains')
plt.subplot2grid((2,2),(1,0))
outliner(test_df,'dividend_from_Stocks')
plt.show()

2.3 哑编码

In [32]:
def dummy_encode(df,filed,a):dummies=pd.get_dummies(df[filed],prefix=a)len=dummies.shape[1]-1a= dummies.iloc[:,0:len]b=pd.concat([df, a], axis=1)del b[filed]return b

In [33]:
print train_df.shape,test_df.shape

(199523, 37) (99762, 37)

In [34]:
df_all=pd.concat([train_df,test_df])

In [35]:
#对训练和测试数据进行哑编码
df_all=dummy_encode(df_all,'fill_questionnaire_veteran_admin','fill_questionnaire_veteran_admin')
df_all=dummy_encode(df_all,'citizenship','citizenship')
df_all=dummy_encode(df_all,'country_self','country_self')
df_all=dummy_encode(df_all,'country_mother','country_mother')
df_all=dummy_encode(df_all,'country_father','country_father')
df_all=dummy_encode(df_all,'family_members_under_18','family_members_under_18')
df_all=dummy_encode(df_all,'live_1_year_ago','live_1_year_ago')
df_all=dummy_encode(df_all,'d_household_summary','d_household_summary')
df_all=dummy_encode(df_all,'class_of_worker','class_of_worker')
df_all=dummy_encode(df_all,'education','education')
df_all=dummy_encode(df_all,'enrolled_in_edu_inst_lastwk','enrolled_in_edu_inst_lastwk')
df_all=dummy_encode(df_all,'marital_status','marital_status')
df_all=dummy_encode(df_all,'major_industry_code','major_industry_code')
df_all=dummy_encode(df_all,'major_occupation_code','major_occupation_code')
df_all=dummy_encode(df_all,'race','race')
df_all=dummy_encode(df_all,'hispanic_origin','hispanic_origin')
df_all=dummy_encode(df_all,'sex','sex')
df_all=dummy_encode(df_all,'member_of_labor_union','member_of_labor_union')
df_all=dummy_encode(df_all,'reason_for_unemployment','reason_for_unemployment')
df_all=dummy_encode(df_all,'full_parttime_employment_stat','full_parttime_employment_stat')
df_all=dummy_encode(df_all,'tax_filer_status','tax_filer_status')
df_all=dummy_encode(df_all,'region_of_previous_residence','region_of_previous_residence')
df_all=dummy_encode(df_all,'state_of_previous_residence','state_of_previous_residence')
df_all=dummy_encode(df_all,'d_household_family_stat','d_household_family_stat')

In [36]:
train_df=df_all.iloc[0:199523,:]
test_df=df_all.iloc[199523:,:]
print train_df.shape,test_df.shape

(199523, 352) (99762, 352)

In [37]:
#test_df.to_csv('testooooo.csv')
#train_df.to_csv('trainooooo.csv')

3. 特征选择

随机森林做特征选择
In [38]:
##将目标变量提到最后一列
Y=train_df['income_level']
del train_df['income_level']
train_df['income_level']=YYT=test_df['income_level']
del test_df['income_level']
test_df['income_level']=YT

D:\Anaconda2\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copyafter removing the cwd from sys.path.
D:\Anaconda2\lib\site-packages\ipykernel_launcher.py:8: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [39]:
y=Y
X=train_df.iloc[:,0:351]

In [73]:
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as pltselected_feat_names=set()
for i in range(10):                           #这里我们进行十次循环取交集tmp = set()rfc = RandomForestClassifier(n_jobs=-1)rfc.fit(X, y)#print("training finished")importances = rfc.feature_importances_indices = np.argsort(importances)[::-1]   # 降序排列S={}for f in range(X.shape[1]):if  importances[indices[f]] >=0.0001:tmp.add(X.columns[indices[f]])S[X.columns[indices[f]]]=importances[indices[f]]#print("%2d) %-*s %f" % (f + 1, 30, X.columns[indices[f]], importances[indices[f]]))selected_feat_names |= tmpimp_fea=pd.Series(S)
print(len(selected_feat_names), "features are selected")

(285, 'features are selected')

In [41]:
train_new=train_df[['income_level']]
test_new=test_df[['income_level']]
for i in selected_feat_names:train_new[i]=train_df[i]try :test_new[i]=test_df[i]except Exception :print '----------------'print idel train_new[i]print train_new.shape,test_new.shape

D:\Anaconda2\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copyafter removing the cwd from sys.path.
D:\Anaconda2\lib\site-packages\ipykernel_launcher.py:6: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

(199523, 292) (99762, 292)

In [42]:
##将目标变量提到最后一列
Y=train_new['income_level']
del train_new['income_level']
train_new['income_level']=YYT=test_new['income_level']
del test_new['income_level']
test_new['income_level']=YT

D:\Anaconda2\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copyafter removing the cwd from sys.path.
D:\Anaconda2\lib\site-packages\ipykernel_launcher.py:8: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [43]:
#train_new.to_csv('train_new.csv')
#test_new.to_csv('test_new.csv')

4. 机器学习

首先:不均衡技术处理(欠采样和过采样技术)
其次:模型选择与训练(xgboost)
最后:调参,调参我希望在保证精确度》0.94的前提下,AUC越大越好

4.1不均衡技术:欠采样和过采样

4.1.1欠采样
train_df  (1,0):(6.20580083499,93.794199165)
test_df  (1,0):(6.20075780357,93.7992421964)
正例:12382 负例:187141
抽样比例:25%
发生率为:21%
In [44]:
def down_sample(df):df1=df[df['income_level']==1]#正例df2=df[df['income_level']==0]##负例df3=df2.sample(frac=0.25)##抽负例return pd.concat([df1,df3],ignore_index=True)

In [45]:
down_train_df=down_sample(train_df)
down_train_new=down_sample(train_new)

4.1.2过采样
train_df  (1,0):(6.20580083499,93.794199165)
test_df  (1,0):(6.20075780357,93.7992421964)
正例:12382 负例:187141
正例复制5次
发生率为:25%
In [46]:
def up_sample(df):df1=df[df['income_level']==1]#正例df2=df[df['income_level']==0]##负例df3=pd.concat([df1,df1,df1,df1,df1],ignore_index=True)return pd.concat([df2,df3],ignore_index=True)

In [47]:
up_train_df=up_sample(train_df)
up_train_new=up_sample(train_new)

4.2 Xgboost 进行机器学习

In [48]:
import xgboost as xgb
from xgboost.sklearn import XGBClassifierfrom sklearn import metrics
from sklearn.cross_validation import train_test_split
#记录程序运行时间
import time ##定义模型参数
param = {}
# use logistic regression loss
param['objective'] = 'binary:logistic'
# scale weight of positive examples
param['scale_pos_weight'] = 1
param['bst:eta'] = 0.2
param['bst:max_depth'] = 6
param['eval_metric'] = 'logloss'
param['silent'] = 1
param['nthread'] = 10Threshold=0.5

D:\Anaconda2\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20."This module will be removed in 0.20.", DeprecationWarning)

In [74]:
def xgb_model(train,tests,list,pam,Threshold):train_xy,val = train_test_split(train, test_size = 0.3,random_state=1)#random_state is of big influence for val-aucy = train_xy['income_level']X = train_xy.drop(['income_level'],axis=1)val_y = val['income_level']val_X = val.drop(['income_level'],axis=1)weight1 = np.ones(len(y))weight2 = np.ones(len(val_y))xgb_val = xgb.DMatrix(val_X,label=val_y,weight = weight2)xgb_train = xgb.DMatrix(X, label=y,weight = weight1)#,weight = weight1test_y = tests['income_level']test_X = tests.drop(['income_level'],axis=1)xgb_test = xgb.DMatrix(test_X)##不要labelwatchlist = [(xgb_train, 'train'),(xgb_val, 'val')]num_round = 100 # 迭代次数# training model print ("training xgboost")threads = list##调参for i in threads:param[pam] = itmp = time.time()plst = param.items()+[('eval_metric', 'ams@0.15')]model = xgb.train( plst, xgb_train, num_round, watchlist,verbose_eval=False )preds = model.predict(xgb_test,ntree_limit=model.best_ntree_limit)print pam,iprint ("XGBoost with %d thread costs: %s seconds" % (i, str(time.time() - tmp)))for j in range(len(preds)):if preds[j]>=Threshold :preds[j]=1else :preds[j]=0print 'AUC: %.4f' % metrics.roc_auc_score(test_y,preds)print 'ACC: %.4f' % metrics.accuracy_score(test_y,preds)return model

**分别在 特征筛选后的train_new,以及在此基础上的上采样和下采样的数据集上进行调参,以此选出最佳模型

4.2.1max_depth调参

list=[6,7,8,9]
上采样时:树深为9,10可以继续调参 因为跨过了ACC为0.94
下采样时,选择树深为4
In [54]:
### 未经过重采样的训练集
list1=[4,6,8,9,10]
pam='max_depth'xgb_model(train_new,test_new,list1,pam,Threshold)

training xgboost
max_depth 4
XGBoost with 4 thread costs: 56.1489999294 seconds
AUC: 0.7273
ACC: 0.9572
max_depth 6
XGBoost with 6 thread costs: 70.1800000668 seconds
AUC: 0.7364
ACC: 0.9581
max_depth 8
XGBoost with 8 thread costs: 90.8049998283 seconds
AUC: 0.7435
ACC: 0.9585
max_depth 9
XGBoost with 9 thread costs: 101.267000198 seconds
AUC: 0.7387
ACC: 0.9579
max_depth 10
XGBoost with 10 thread costs: 112.95600009 seconds
AUC: 0.7417
ACC: 0.9577

In [55]:
### 下采样的训练集
xgb_model(down_train_new,test_new,list1,pam,Threshold)

training xgboost
max_depth 4
XGBoost with 4 thread costs: 14.1289999485 seconds
AUC: 0.8398
ACC: 0.9395
max_depth 6
XGBoost with 6 thread costs: 19.8990001678 seconds
AUC: 0.8398
ACC: 0.9385
max_depth 8
XGBoost with 8 thread costs: 26.1499998569 seconds
AUC: 0.8410
ACC: 0.9376
max_depth 9
XGBoost with 9 thread costs: 29.2709999084 seconds
AUC: 0.8396
ACC: 0.9373
max_depth 10
XGBoost with 10 thread costs: 32.6649999619 seconds
AUC: 0.8380
ACC: 0.9368

In [56]:
### 上采样的训练集
xgb_model(up_train_new,test_new,list1,pam,Threshold)

training xgboost
max_depth 4
XGBoost with 4 thread costs: 66.6740000248 seconds
AUC: 0.8555
ACC: 0.9333
max_depth 6
XGBoost with 6 thread costs: 88.0920000076 seconds
AUC: 0.8545
ACC: 0.9352
max_depth 8
XGBoost with 8 thread costs: 113.984999895 seconds
AUC: 0.8507
ACC: 0.9380
max_depth 9
XGBoost with 9 thread costs: 127.167999983 seconds
AUC: 0.8485
ACC: 0.9398
max_depth 10
XGBoost with 10 thread costs: 142.092000008 seconds
AUC: 0.8416
ACC: 0.9410

4.2.2scale_pos_weight调参

list=[0.2,0.4,0.5,0.6,0.8,1]
经过上个步骤,上下采样的AUC都有不错的表现,但是ACC不足
而未经重采样的数据集的AUC的峰值大约在0.74左右,需要舍弃这个数据集
下采样则是ACC达不到0.94
下面着重在树深为9,10 的基础上调整上采样的ACC结果:上采样选择scale_pos_weight=1.0,树深为9,AUC:0.8485,ACC:0.9398
In [59]:
### 上采样
param['max_depth'] = 10
list2=[0.8,0.9,1.0,1.1,1.2]
pam='scale_pos_weight'
xgb_model(up_train_new,test_new,list2,pam,Threshold)

training xgboost
scale_pos_weight 0.8
XGBoost with 0 thread costs: 145.473999977 seconds
AUC: 0.8300
ACC: 0.9456
scale_pos_weight 0.9
XGBoost with 0 thread costs: 141.223999977 seconds
AUC: 0.8359
ACC: 0.9430
scale_pos_weight 1.0
XGBoost with 1 thread costs: 143.575999975 seconds
AUC: 0.8416
ACC: 0.9410
scale_pos_weight 1.1
XGBoost with 1 thread costs: 171.828999996 seconds
AUC: 0.8442
ACC: 0.9392
scale_pos_weight 1.2
XGBoost with 1 thread costs: 165.302000046 seconds
AUC: 0.8500
ACC: 0.9364

In [60]:
### 上采样
param['max_depth'] = 9
list2=[0.8,0.9,1.0,1.1,1.2]
pam='scale_pos_weight'
Threshold=0.5
xgb_model(up_train_new,test_new,list2,pam,Threshold)

training xgboost
scale_pos_weight 0.8
XGBoost with 0 thread costs: 142.86500001 seconds
AUC: 0.8362
ACC: 0.9445
scale_pos_weight 0.9
XGBoost with 0 thread costs: 145.427999973 seconds
AUC: 0.8412
ACC: 0.9413
scale_pos_weight 1.0
XGBoost with 1 thread costs: 141.220000029 seconds
AUC: 0.8485
ACC: 0.9398
scale_pos_weight 1.1
XGBoost with 1 thread costs: 144.178000212 seconds
AUC: 0.8519
ACC: 0.9363
scale_pos_weight 1.2
XGBoost with 1 thread costs: 143.682999849 seconds
AUC: 0.8547
ACC: 0.9349

4.2.3阈值调参

Threshold 0.45-0.55,步进0.01
最终选择:Threshold= 0.51
此时:
AUC: 0.8466
ACC: 0.9409
In [64]:
for m in np.arange(0.45,0.55,0.01):param['scale_pos_weight'] = 1.0list3=[9]pam='max_depth'Threshold=mprint 'Threshold=',mxgb_model(up_train_new,test_new,list3,pam,Threshold)

Threshold= 0.45
training xgboost
max_depth 9
XGBoost with 9 thread costs: 162.524000168 seconds
AUC: 0.8565
ACC: 0.9337
Threshold= 0.46
training xgboost
max_depth 9
XGBoost with 9 thread costs: 160.478999853 seconds
AUC: 0.8555
ACC: 0.9352
Threshold= 0.47
training xgboost
max_depth 9
XGBoost with 9 thread costs: 153.284000158 seconds
AUC: 0.8537
ACC: 0.9363
Threshold= 0.48
training xgboost
max_depth 9
XGBoost with 9 thread costs: 153.253999949 seconds
AUC: 0.8523
ACC: 0.9375
Threshold= 0.49
training xgboost
max_depth 9
XGBoost with 9 thread costs: 152.437000036 seconds
AUC: 0.8499
ACC: 0.9384
Threshold= 0.5
training xgboost
max_depth 9
XGBoost with 9 thread costs: 143.042000055 seconds
AUC: 0.8485
ACC: 0.9398
Threshold= 0.51
training xgboost
max_depth 9
XGBoost with 9 thread costs: 136.401000023 seconds
AUC: 0.8466
ACC: 0.9409
Threshold= 0.52
training xgboost
max_depth 9
XGBoost with 9 thread costs: 139.329999924 seconds
AUC: 0.8445
ACC: 0.9418
Threshold= 0.53
training xgboost
max_depth 9
XGBoost with 9 thread costs: 138.738999844 seconds
AUC: 0.8425
ACC: 0.9428
Threshold= 0.54
training xgboost
max_depth 9
XGBoost with 9 thread costs: 156.219000101 seconds
AUC: 0.8404
ACC: 0.9436
Threshold= 0.55
training xgboost
max_depth 9
XGBoost with 9 thread costs: 164.50999999 seconds
AUC: 0.8381
ACC: 0.9446

4.3 重要特征可视化

In [72]:
from xgboost import plot_importance
import matplotlib.pyplot as plt
from graphviz import Digraph
import pydot

In [76]:
param['scale_pos_weight'] = 1.0
list3=[9]
pam='max_depth'
Threshold=0.51
model=xgb_model(up_train_new,test_new,list3,pam,Threshold)

training xgboost
max_depth 9
XGBoost with 9 thread costs: 156.400000095 seconds
AUC: 0.8466
ACC: 0.9409

In [102]:
imp_feat=imp_fea.sort_values()[::-1]
feat_imp=imp_feat[:30]

In [103]:
feat_imp.plot(kind='bar')

Out[103]:
<matplotlib.axes._subplots.AxesSubplot at 0xe8124a8>

基于Xgboost的不均衡数据分类相关推荐

  1. smoteenn算法_基于EasyEnsemble算法和SMOTE算法的不均衡数据分类方法与流程

    本发明涉及不均衡数据二分类技术领域,尤其涉及一种基于EasyEnsemble算法和SMOTE算法的不均衡数据二分类方法. 背景技术: 数据不均衡指的是在一个样本数据集中,某一类的样本数远少于其他类的样 ...

  2. 基于XGBoost的用户流失预测

    基于XGBoost的用户流失预测 小P:小H,我怎么能知道哪些用户有可能会流失呢?我这里有一份数据,你帮忙看看哪些字段更有助于寻找流失用户 小H:我只需要告诉你哪些特征更重要是吗? 小P:对对- 小H ...

  3. 用haproxy结合keepalived实现基于LNMP的负载均衡和高可用

    今天我们讲haproxy结合keepalived实现LNMP的负载均衡和高可用,现在的公司大部分都基于haproxy实现负载均衡.下面以一个事例去给大家详细讲解如何去实现: 一.用haproxy结合k ...

  4. ML之xgboost:基于xgboost(5f-CrVa)算法对HiggsBoson数据集(Kaggle竞赛)训练实现二分类预测(基于训练好的模型进行新数据预测)

    ML之xgboost:基于xgboost(5f-CrVa)算法对HiggsBoson数据集(Kaggle竞赛)训练实现二分类预测(基于训练好的模型进行新数据预测) 目录 输出结果 设计思路 核心代码 ...

  5. ML之xgboost:基于xgboost(5f-CrVa)算法对HiggsBoson数据集(Kaggle竞赛)训练(模型保存+可视化)实现二分类预测

    ML之xgboost:基于xgboost(5f-CrVa)算法对HiggsBoson数据集(Kaggle竞赛)训练(模型保存+可视化)实现二分类预测 目录 数据集简介 输出结果 设计思路 核心代码 数 ...

  6. 基于XGBoost的PU-Learning

    论文:Detecting positive and negative deceptive opinions using PU-learning PU-learning是一种只有正样本的半监督的二分类器 ...

  7. 高可用与负载均衡(5)之基于客户端的负载均衡

    什么是客户端负载均衡 基于客户端的负载均衡,简单的说就是在客户端程序里面,自己设定一个调度算法,在向服务器发起请求的时候,先执行调度算法计算出向哪台服务器发起请求,然后再发起请求给服务器. 基于客户端 ...

  8. NGINX基于Tomcat配置负载均衡

    NGINX基于Tomcat配置负载均衡 本部署指南说明了如何使用NGINX开源和NGINX Plus在Apache Tomcat TM应用程序服务器池之间平衡HTTP和HTTPS流量.本指南中的详细说 ...

  9. 基于 XGBoost 对 Santander 银行用户购买行为进行预测

    Santander Product Recommendation 是我去年做的一个数据挖掘 project,简单来说就是,给了一定量的数据,用合适的算法对这些数据进行建模分析,给出预测,从而挖掘出有价 ...

最新文章

  1. 深度解析 PouchContainer 的富容器技术
  2. CVPR2021|深度感知镜面分割方法(RGBD数据)
  3. 五子棋博弈树剪枝c语言,五子棋AI博弈树之带Alpha-Beta剪枝的极大极小过程函数...
  4. python爬虫代码房-Python爬虫一步步抓取房产信息
  5. 树莓派 Ubuntu mate 18.04 下开启vncserver
  6. java求一个数的阶乘_Java如何使用方法计算一个数字的阶乘值?
  7. 手写自己的MyBatis框架-Executor
  8. axure文本框值相加_Axure教程:计数文本域实现
  9. 利用BenchmarkDotNet 测试 .Net Core API 同步和异步方法性能
  10. C#中string.Concat方法的使用
  11. Java 计算两个日期相差的天数
  12. python3记录(1) - 内置函数
  13. DBA+北京社群第三次线下沙龙归来
  14. mysql更新一条语句_讲讲一条MySQL更新语句是怎么执行的?
  15. 人肉搜索、人肉语言及人肉程序设计
  16. 小米笔记本bios版本大全_聊一款被“差别对待”的笔记本电脑
  17. 悟空互动:如何让百度更快的收录网站,试试快速收录提交入口!
  18. android动态指示箭头,自定义选项卡指示器(箭头向下指示器)
  19. 201709-2公共钥匙盒
  20. html代码制作的个人简历

热门文章

  1. MyBatis 开发有bug找不到?多看看执行流程
  2. 一步步教你怎么用Python写贪吃蛇游戏
  3. 2022年团体程序设计天梯赛C++个人题解附带解题思路
  4. oracle leg,[LEG引擎]英雄合击数据库
  5. 《炬丰科技-半导体工艺》不破坏MEMS结构的颗粒去除方法
  6. 记一次GLIB2.14升级GLIB2.18的记录以及其中的步骤原理
  7. the second day
  8. 关于“为什么选择我们公司?”应聘者如何回答?
  9. OpenCV 32F 与 8U Mat数据类型相互转换(C++版)
  10. Linux读取下机数据.fq.gz文件