基于Xgboost的不均衡数据分类
1.项目分析与设计
该项目通过美国人口普查数据训练一个模型来预测美国人口收入水平。数据集上包含199523个训练数据和99762个测试数据,各包含了41个属性。经分析,该数据包含了人口统计信息、年龄、贷款信息、国籍、种族等信息。属性数据中有包含空值和有偏分布等问题,处理思路如下:
1.读取数据,观察特征及其分布
2.分析缺失情况,处理缺失值
3.异常值处理
4.对分类变量进行哑编码
5.用随机森林进行重要特征筛选
6.重采样对不均衡数据进行处理
7.构建XGBOOST模型,并进行建模分析预测
2. 数据探索
import numpy as np import pandas as pd
train_df=pd.read_csv('train.csv') test_df=pd.read_csv('test.csv') print ('train_df:%s,%s'%train_df.shape) print ('test_df:%s,%s'%test_df.shape)
train_df:199523,41 test_df:99762,41
##检查因变量train_df.income_level.unique() test_df.income_level.unique()
array(['-50000', '50000+.'], dtype=object)
#为了便于分析,将变量编码为0,1 train_df.loc[train_df['income_level']==-50000,'income_level']=0 train_df.loc[train_df['income_level']== 50000,'income_level']=1 test_df.loc[test_df['income_level']=='-50000','income_level']=0 test_df.loc[test_df['income_level']=='50000+.','income_level']=1
##检查样本不均衡程度 a=train_df['income_level'].sum()*100.0/train_df['income_level'].count() b=test_df['income_level'].sum()*100.0/test_df['income_level'].count() print ('train_df (1,0):(%s,%s)'%(a,100-a)) print ('test_df (1,0):(%s,%s)'%(b,100-b))
train_df (1,0):(6.20580083499,93.794199165) test_df (1,0):(6.20075780357,93.7992421964)
观察数据
train_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 199523 entries, 0 to 199522 Data columns (total 41 columns): age 199523 non-null int64 class_of_worker 199523 non-null object industry_code 199523 non-null int64 occupation_code 199523 non-null int64 education 199523 non-null object wage_per_hour 199523 non-null int64 enrolled_in_edu_inst_lastwk 199523 non-null object marital_status 199523 non-null object major_industry_code 199523 non-null object major_occupation_code 199523 non-null object race 199523 non-null object hispanic_origin 198649 non-null object sex 199523 non-null object member_of_labor_union 199523 non-null object reason_for_unemployment 199523 non-null object full_parttime_employment_stat 199523 non-null object capital_gains 199523 non-null int64 capital_losses 199523 non-null int64 dividend_from_Stocks 199523 non-null int64 tax_filer_status 199523 non-null object region_of_previous_residence 199523 non-null object state_of_previous_residence 198815 non-null object d_household_family_stat 199523 non-null object d_household_summary 199523 non-null object migration_msa 99827 non-null object migration_reg 99827 non-null object migration_within_reg 99827 non-null object live_1_year_ago 199523 non-null object migration_sunbelt 99827 non-null object num_person_Worked_employer 199523 non-null int64 family_members_under_18 199523 non-null object country_father 192810 non-null object country_mother 193404 non-null object country_self 196130 non-null object citizenship 199523 non-null object business_or_self_employed 199523 non-null int64 fill_questionnaire_veteran_admin 199523 non-null object veterans_benefits 199523 non-null int64 weeks_worked_in_year 199523 non-null int64 year 199523 non-null int64 income_level 199523 non-null int64 dtypes: int64(13), object(28) memory usage: 62.4+ MB
**数值数据观察
import matplotlib.pyplot as plt def num_tr(filed,n):fig=plt.figure(figsize=(10,5))train_df[filed].hist(bins=n) plt.title('%s'%filed) plt.show()
1.age
1.1 分布
num_tr('age',100)
如图可以观察到,年龄在0-90之间,并且随着年龄的增大,人数减少
我猜测20岁以下和步入工作不久的人,比较不可能>50K,但是也不一定
现在将其分组,0-22,22-35,35-60,60-90,对应编码为:0,1,2,3(22岁为本科毕业平均年龄,35为工作初期(前10年),60岁为退休年龄)
''' #创建年龄分组字段 labels=[0,1,2,3,4,5,6,7,8,9] train_df['age_class']=pd.cut(train_df['age'],bins=[-1,10,20,30,40,50,60,70,80,90,100],labels=labels) test_df['age_class']=pd.cut(test_df['age'],bins=[-1,10,20,30,40,50,60,70,80,90,100],labels=labels) '''
"\n#\xe5\x88\x9b\xe5\xbb\xba\xe5\xb9\xb4\xe9\xbe\x84\xe5\x88\x86\xe7\xbb\x84\xe5\xad\x97\xe6\xae\xb5\nlabels=[0,1,2,3,4,5,6,7,8,9]\ntrain_df['age_class']=pd.cut(train_df['age'],bins=[-1,10,20,30,40,50,60,70,80,90,100],labels=labels)\ntest_df['age_class']=pd.cut(test_df['age'],bins=[-1,10,20,30,40,50,60,70,80,90,100],labels=labels)\n"
1.2 年龄与目标变量的关系
收入水平为1的主要集中在30-50岁之间,并且可以看出,收入水平为1的人群年龄分布是接近正态的,均值为50.
''' fig=plt.figure(figsize=(12,6)) train_df.groupby(['age_class','income_level'])['income_level'].count().unstack().plot(kind='bar')plt.title('income_level wrt age') plt.show() '''
"\nfig=plt.figure(figsize=(12,6))\ntrain_df.groupby(['age_class','income_level'])['income_level'].count().unstack().plot(kind='bar')\n\nplt.title('income_level wrt age') \nplt.show()\n \n"
#查看收入水平的人群的年龄分布 fig=plt.figure(figsize=(12,6)) train_df.age[train_df.income_level==0].plot(kind='kde') train_df.age[train_df.income_level==1].plot(kind='kde') plt.legend(('0','1')) plt.show()
2.capital_losses&capital_gains
右偏数据。后续有待进一步分析。
fig=plt.figure(figsize=(8,4)) plt.subplot2grid((1,2),(0,0)) train_df.capital_gains.plot(kind='box') plt.subplot2grid((1,2),(0,1)) train_df.capital_losses.plot(kind='box') plt.show()
3.weeks_worked_in_year
水平为0的主要集中在0,50,而水平位1的则主要为50. 这里可以看出,水平为1的该变量几乎没有取值为1的。
#查看收入水平的人群的周工作时长分布 fig=plt.figure(figsize=(8,4)) plt.subplot2grid((1,2),(0,0)) train_df.weeks_worked_in_year[train_df.income_level==0].hist(bins=20) plt.subplot2grid((1,2),(0,1)) train_df.weeks_worked_in_year[train_df.income_level==1].hist(bins=20,color='r') plt.show()
4.divided_from_stocks
右偏数据。后续有待进一步分析。
fig=plt.figure(figsize=(12,6)) train_df.dividend_from_Stocks[train_df.income_level==0].hist(bins=100) train_df.dividend_from_Stocks[train_df.income_level==1].hist(bins=100) plt.legend(('0','1')) plt.show()
5.num_person_Worked_employment
level为0的主要为0,level为1的则主要是6.
fig=plt.figure(figsize=(12,6)) #train_df.num_person_Worked_employer[train_df.income_level==0].hist(bins=100) #train_df.num_person_Worked_employer[train_df.income_level==1].hist(bins=100) train_df.groupby(['num_person_Worked_employer','income_level'])['income_level'].count().unstack().plot(kind='bar') plt.legend(('0','1')) plt.show()
<matplotlib.figure.Figure at 0x10584f28>
**分类数据观察
1.class_of_worker
虽然没有提供有关Not in universe类别的具体信息。我们假设这个答案是由填写人口普查数据而感到沮丧的人(由于任何原因)给出的。
这个变量看起来不平衡,即只有两个类别似乎占主导地位。在这种情况下,一个好的做法是将总分类频率的频率小于5%的水平组合起来。后续处理。
fig=plt.figure(figsize=(18,12)) train_df.groupby(['class_of_worker','income_level'])['income_level'].count().unstack().plot(kind='bar') #plt.xticks(rotation=120) plt.legend(('0','1')) plt.show()
<matplotlib.figure.Figure at 0x1069ac18>
2.education
Bachelors degree学士学位有最多level为1的
fig=plt.figure(figsize=(18,8)) train_df.groupby(['education','income_level'])['income_level'].count().unstack().plot(kind='bar') #plt.xticks(rotation=30) plt.legend(('0','1')) plt.show()
<matplotlib.figure.Figure at 0x106c4ac8>
3.marital_status
Married-civilian spouse present已婚且配偶在场的人中level=1最多
fig=plt.figure(figsize=(18,8)) train_df.groupby(['marital_status','income_level'])['income_level'].count().unstack().plot(kind='bar') #plt.xticks(rotation=30) plt.legend(('0','1')) plt.show()
<matplotlib.figure.Figure at 0x106e53c8>
4.race种族
白人占统样本的多数,并且,白人的level=1也最多。
fig=plt.figure(figsize=(18,8)) train_df.groupby(['race','income_level'])['income_level'].count().unstack().plot(kind='bar') #plt.xticks(rotation=30) plt.legend(('0','1')) plt.show()
<matplotlib.figure.Figure at 0x106f5358>
5.sex
样本中,整体女性人数占比较大,但是level为1的主要为男性。
fig=plt.figure(figsize=(18,8)) train_df.groupby(['sex','income_level'])['income_level'].count().unstack().plot(kind='bar') #plt.xticks(rotation=30) plt.legend(('0','1')) plt.show()
<matplotlib.figure.Figure at 0xdcd7e10>
6.member_of_labor_union
两类都集中在Not in universe中
fig=plt.figure(figsize=(18,8)) train_df.groupby(['member_of_labor_union','income_level'])['income_level'].count().unstack().plot(kind='bar') #plt.xticks(rotation=30) plt.legend(('0','1')) plt.show()
<matplotlib.figure.Figure at 0x10171630>
7.full_parttime_employment_stat
fig=plt.figure(figsize=(18,8)) train_df.groupby(['full_parttime_employment_stat','income_level'])['income_level'].count().unstack().plot(kind='bar') #plt.xticks(rotation=30) plt.legend(('0','1')) plt.show()
<matplotlib.figure.Figure at 0xfd74160>
8.tax_filer_status
fig=plt.figure(figsize=(18,8)) train_df.groupby(['tax_filer_status','income_level'])['income_level'].count().unstack().plot(kind='bar') #plt.xticks(rotation=30) plt.legend(('0','1')) plt.show()
<matplotlib.figure.Figure at 0x8c01e10>
9.business_or_self_employed
fig=plt.figure(figsize=(18,8)) train_df.groupby(['business_or_self_employed','income_level'])['income_level'].count().unstack().plot(kind='bar') #plt.xticks(rotation=30) plt.legend(('0','1')) plt.show()
<matplotlib.figure.Figure at 0xfe743c8>
2. 数据预处理
2.1 缺失值处理
test数据没有缺失值,但是有?,首先将?置为空
s=pd.Series(train_df.isnull().sum()) print s ss=pd.Series(test_df.isnull().sum()) print ss
age 0 class_of_worker 0 industry_code 0 occupation_code 0 education 0 wage_per_hour 0 enrolled_in_edu_inst_lastwk 0 marital_status 0 major_industry_code 0 major_occupation_code 0 race 0 hispanic_origin 874 sex 0 member_of_labor_union 0 reason_for_unemployment 0 full_parttime_employment_stat 0 capital_gains 0 capital_losses 0 dividend_from_Stocks 0 tax_filer_status 0 region_of_previous_residence 0 state_of_previous_residence 708 d_household_family_stat 0 d_household_summary 0 migration_msa 99696 migration_reg 99696 migration_within_reg 99696 live_1_year_ago 0 migration_sunbelt 99696 num_person_Worked_employer 0 family_members_under_18 0 country_father 6713 country_mother 6119 country_self 3393 citizenship 0 business_or_self_employed 0 fill_questionnaire_veteran_admin 0 veterans_benefits 0 weeks_worked_in_year 0 year 0 income_level 0 dtype: int64 age 0 class_of_worker 0 industry_code 0 occupation_code 0 education 0 wage_per_hour 0 enrolled_in_edu_inst_lastwk 0 marital_status 0 major_industry_code 0 major_occupation_code 0 race 0 hispanic_origin 405 sex 0 member_of_labor_union 0 reason_for_unemployment 0 full_parttime_employment_stat 0 capital_gains 0 capital_losses 0 dividend_from_Stocks 0 tax_filer_status 0 region_of_previous_residence 0 state_of_previous_residence 330 d_household_family_stat 0 d_household_summary 0 migration_msa 49946 migration_reg 49946 migration_within_reg 49946 live_1_year_ago 0 migration_sunbelt 49946 num_person_Worked_employer 0 family_members_under_18 0 country_father 3429 country_mother 3072 country_self 1764 citizenship 0 business_or_self_employed 0 fill_questionnaire_veteran_admin 0 veterans_benefits 0 weeks_worked_in_year 0 year 0 income_level 0 dtype: int64
计算出缺失百分比
##检查样本缺失比例 m=train_df.shape[0] for i,j in s.iteritems():if j>0:print i,j*100.0/m print '----------------------------' #检查测试样本缺失比例 n=test_df.shape[0] for i,j in ss.iteritems():if j>0:print i,j*100.0/n
hispanic_origin 0.438044736697 state_of_previous_residence 0.354846308446 migration_msa 49.9671717045 migration_reg 49.9671717045 migration_within_reg 49.9671717045 migration_sunbelt 49.9671717045 country_father 3.36452439067 country_mother 3.06681435223 country_self 1.70055582564 ---------------------------- hispanic_origin 0.405966199555 state_of_previous_residence 0.330787273711 migration_msa 50.0651550691 migration_reg 50.0651550691 migration_within_reg 50.0651550691 migration_sunbelt 50.0651550691 country_father 3.43718048957 country_mother 3.07932880255 country_self 1.76820833584
删掉缺失值过多的列(4列接近50%)
del train_df['migration_msa'] del train_df['migration_reg'] del train_df['migration_within_reg'] del train_df['migration_sunbelt']del test_df['migration_msa'] del test_df['migration_reg'] del test_df['migration_within_reg'] del test_df['migration_sunbelt']
其他五个变量的缺失值占比极小,决定添加字段(不均衡样本尽量塑造数据而不是删除宝贵的数据)
train_df['hispanic_origin']= train_df['hispanic_origin'].fillna('others') train_df['state_of_previous_residence']= train_df['state_of_previous_residence'].fillna('others') train_df['country_father']= train_df['country_father'].fillna('others') train_df['country_mother']= train_df['country_mother'].fillna('others') train_df['country_self']= train_df['country_self'].fillna('others')test_df['hispanic_origin']= test_df['hispanic_origin'].fillna('others') test_df['state_of_previous_residence']= test_df['state_of_previous_residence'].fillna('others') test_df['country_father']= test_df['country_father'].fillna('others') test_df['country_mother']= test_df['country_mother'].fillna('others') test_df['country_self']= test_df['country_self'].fillna('others')
2.2 异常值处理
*变量转换处理极度右偏数据(对数处理)
def outliner(df,filed):df[filed]=np.log(df[filed]+1)df[filed].plot(kind='kde')
#训练数据 fig=plt.figure(figsize=(15,5)) plt.subplot2grid((2,2),(0,0)) outliner(train_df,'capital_losses') #.capital_losses&capital_gains plt.subplot2grid((2,2),(0,1)) outliner(train_df,'capital_gains') plt.subplot2grid((2,2),(1,0)) outliner(train_df,'dividend_from_Stocks') plt.show()
##测试数据 fig=plt.figure(figsize=(15,5)) plt.subplot2grid((2,2),(0,0)) outliner(test_df,'capital_losses') #.capital_losses&capital_gains plt.subplot2grid((2,2),(0,1)) outliner(test_df,'capital_gains') plt.subplot2grid((2,2),(1,0)) outliner(test_df,'dividend_from_Stocks') plt.show()
2.3 哑编码
def dummy_encode(df,filed,a):dummies=pd.get_dummies(df[filed],prefix=a)len=dummies.shape[1]-1a= dummies.iloc[:,0:len]b=pd.concat([df, a], axis=1)del b[filed]return b
print train_df.shape,test_df.shape
(199523, 37) (99762, 37)
df_all=pd.concat([train_df,test_df])
#对训练和测试数据进行哑编码 df_all=dummy_encode(df_all,'fill_questionnaire_veteran_admin','fill_questionnaire_veteran_admin') df_all=dummy_encode(df_all,'citizenship','citizenship') df_all=dummy_encode(df_all,'country_self','country_self') df_all=dummy_encode(df_all,'country_mother','country_mother') df_all=dummy_encode(df_all,'country_father','country_father') df_all=dummy_encode(df_all,'family_members_under_18','family_members_under_18') df_all=dummy_encode(df_all,'live_1_year_ago','live_1_year_ago') df_all=dummy_encode(df_all,'d_household_summary','d_household_summary') df_all=dummy_encode(df_all,'class_of_worker','class_of_worker') df_all=dummy_encode(df_all,'education','education') df_all=dummy_encode(df_all,'enrolled_in_edu_inst_lastwk','enrolled_in_edu_inst_lastwk') df_all=dummy_encode(df_all,'marital_status','marital_status') df_all=dummy_encode(df_all,'major_industry_code','major_industry_code') df_all=dummy_encode(df_all,'major_occupation_code','major_occupation_code') df_all=dummy_encode(df_all,'race','race') df_all=dummy_encode(df_all,'hispanic_origin','hispanic_origin') df_all=dummy_encode(df_all,'sex','sex') df_all=dummy_encode(df_all,'member_of_labor_union','member_of_labor_union') df_all=dummy_encode(df_all,'reason_for_unemployment','reason_for_unemployment') df_all=dummy_encode(df_all,'full_parttime_employment_stat','full_parttime_employment_stat') df_all=dummy_encode(df_all,'tax_filer_status','tax_filer_status') df_all=dummy_encode(df_all,'region_of_previous_residence','region_of_previous_residence') df_all=dummy_encode(df_all,'state_of_previous_residence','state_of_previous_residence') df_all=dummy_encode(df_all,'d_household_family_stat','d_household_family_stat')
train_df=df_all.iloc[0:199523,:] test_df=df_all.iloc[199523:,:] print train_df.shape,test_df.shape
(199523, 352) (99762, 352)
#test_df.to_csv('testooooo.csv') #train_df.to_csv('trainooooo.csv')
3. 特征选择
随机森林做特征选择
##将目标变量提到最后一列 Y=train_df['income_level'] del train_df['income_level'] train_df['income_level']=YYT=test_df['income_level'] del test_df['income_level'] test_df['income_level']=YT
D:\Anaconda2\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copyafter removing the cwd from sys.path. D:\Anaconda2\lib\site-packages\ipykernel_launcher.py:8: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
y=Y X=train_df.iloc[:,0:351]
from sklearn.ensemble import RandomForestClassifier import matplotlib.pyplot as pltselected_feat_names=set() for i in range(10): #这里我们进行十次循环取交集tmp = set()rfc = RandomForestClassifier(n_jobs=-1)rfc.fit(X, y)#print("training finished")importances = rfc.feature_importances_indices = np.argsort(importances)[::-1] # 降序排列S={}for f in range(X.shape[1]):if importances[indices[f]] >=0.0001:tmp.add(X.columns[indices[f]])S[X.columns[indices[f]]]=importances[indices[f]]#print("%2d) %-*s %f" % (f + 1, 30, X.columns[indices[f]], importances[indices[f]]))selected_feat_names |= tmpimp_fea=pd.Series(S) print(len(selected_feat_names), "features are selected")
(285, 'features are selected')
train_new=train_df[['income_level']] test_new=test_df[['income_level']] for i in selected_feat_names:train_new[i]=train_df[i]try :test_new[i]=test_df[i]except Exception :print '----------------'print idel train_new[i]print train_new.shape,test_new.shape
D:\Anaconda2\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copyafter removing the cwd from sys.path. D:\Anaconda2\lib\site-packages\ipykernel_launcher.py:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
(199523, 292) (99762, 292)
##将目标变量提到最后一列 Y=train_new['income_level'] del train_new['income_level'] train_new['income_level']=YYT=test_new['income_level'] del test_new['income_level'] test_new['income_level']=YT
D:\Anaconda2\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copyafter removing the cwd from sys.path. D:\Anaconda2\lib\site-packages\ipykernel_launcher.py:8: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
#train_new.to_csv('train_new.csv') #test_new.to_csv('test_new.csv')
4. 机器学习
首先:不均衡技术处理(欠采样和过采样技术)
其次:模型选择与训练(xgboost)
最后:调参,调参我希望在保证精确度》0.94的前提下,AUC越大越好
4.1不均衡技术:欠采样和过采样
4.1.1欠采样
train_df (1,0):(6.20580083499,93.794199165)
test_df (1,0):(6.20075780357,93.7992421964)
正例:12382 负例:187141
抽样比例:25%
发生率为:21%
def down_sample(df):df1=df[df['income_level']==1]#正例df2=df[df['income_level']==0]##负例df3=df2.sample(frac=0.25)##抽负例return pd.concat([df1,df3],ignore_index=True)
down_train_df=down_sample(train_df) down_train_new=down_sample(train_new)
4.1.2过采样
train_df (1,0):(6.20580083499,93.794199165)
test_df (1,0):(6.20075780357,93.7992421964)
正例:12382 负例:187141
正例复制5次
发生率为:25%
def up_sample(df):df1=df[df['income_level']==1]#正例df2=df[df['income_level']==0]##负例df3=pd.concat([df1,df1,df1,df1,df1],ignore_index=True)return pd.concat([df2,df3],ignore_index=True)
up_train_df=up_sample(train_df) up_train_new=up_sample(train_new)
4.2 Xgboost 进行机器学习
import xgboost as xgb from xgboost.sklearn import XGBClassifierfrom sklearn import metrics from sklearn.cross_validation import train_test_split #记录程序运行时间 import time ##定义模型参数 param = {} # use logistic regression loss param['objective'] = 'binary:logistic' # scale weight of positive examples param['scale_pos_weight'] = 1 param['bst:eta'] = 0.2 param['bst:max_depth'] = 6 param['eval_metric'] = 'logloss' param['silent'] = 1 param['nthread'] = 10Threshold=0.5
D:\Anaconda2\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20."This module will be removed in 0.20.", DeprecationWarning)
def xgb_model(train,tests,list,pam,Threshold):train_xy,val = train_test_split(train, test_size = 0.3,random_state=1)#random_state is of big influence for val-aucy = train_xy['income_level']X = train_xy.drop(['income_level'],axis=1)val_y = val['income_level']val_X = val.drop(['income_level'],axis=1)weight1 = np.ones(len(y))weight2 = np.ones(len(val_y))xgb_val = xgb.DMatrix(val_X,label=val_y,weight = weight2)xgb_train = xgb.DMatrix(X, label=y,weight = weight1)#,weight = weight1test_y = tests['income_level']test_X = tests.drop(['income_level'],axis=1)xgb_test = xgb.DMatrix(test_X)##不要labelwatchlist = [(xgb_train, 'train'),(xgb_val, 'val')]num_round = 100 # 迭代次数# training model print ("training xgboost")threads = list##调参for i in threads:param[pam] = itmp = time.time()plst = param.items()+[('eval_metric', 'ams@0.15')]model = xgb.train( plst, xgb_train, num_round, watchlist,verbose_eval=False )preds = model.predict(xgb_test,ntree_limit=model.best_ntree_limit)print pam,iprint ("XGBoost with %d thread costs: %s seconds" % (i, str(time.time() - tmp)))for j in range(len(preds)):if preds[j]>=Threshold :preds[j]=1else :preds[j]=0print 'AUC: %.4f' % metrics.roc_auc_score(test_y,preds)print 'ACC: %.4f' % metrics.accuracy_score(test_y,preds)return model
**分别在 特征筛选后的train_new,以及在此基础上的上采样和下采样的数据集上进行调参,以此选出最佳模型
4.2.1max_depth调参
list=[6,7,8,9]
上采样时:树深为9,10可以继续调参 因为跨过了ACC为0.94
下采样时,选择树深为4
### 未经过重采样的训练集 list1=[4,6,8,9,10] pam='max_depth'xgb_model(train_new,test_new,list1,pam,Threshold)
training xgboost max_depth 4 XGBoost with 4 thread costs: 56.1489999294 seconds AUC: 0.7273 ACC: 0.9572 max_depth 6 XGBoost with 6 thread costs: 70.1800000668 seconds AUC: 0.7364 ACC: 0.9581 max_depth 8 XGBoost with 8 thread costs: 90.8049998283 seconds AUC: 0.7435 ACC: 0.9585 max_depth 9 XGBoost with 9 thread costs: 101.267000198 seconds AUC: 0.7387 ACC: 0.9579 max_depth 10 XGBoost with 10 thread costs: 112.95600009 seconds AUC: 0.7417 ACC: 0.9577
### 下采样的训练集 xgb_model(down_train_new,test_new,list1,pam,Threshold)
training xgboost max_depth 4 XGBoost with 4 thread costs: 14.1289999485 seconds AUC: 0.8398 ACC: 0.9395 max_depth 6 XGBoost with 6 thread costs: 19.8990001678 seconds AUC: 0.8398 ACC: 0.9385 max_depth 8 XGBoost with 8 thread costs: 26.1499998569 seconds AUC: 0.8410 ACC: 0.9376 max_depth 9 XGBoost with 9 thread costs: 29.2709999084 seconds AUC: 0.8396 ACC: 0.9373 max_depth 10 XGBoost with 10 thread costs: 32.6649999619 seconds AUC: 0.8380 ACC: 0.9368
### 上采样的训练集 xgb_model(up_train_new,test_new,list1,pam,Threshold)
training xgboost max_depth 4 XGBoost with 4 thread costs: 66.6740000248 seconds AUC: 0.8555 ACC: 0.9333 max_depth 6 XGBoost with 6 thread costs: 88.0920000076 seconds AUC: 0.8545 ACC: 0.9352 max_depth 8 XGBoost with 8 thread costs: 113.984999895 seconds AUC: 0.8507 ACC: 0.9380 max_depth 9 XGBoost with 9 thread costs: 127.167999983 seconds AUC: 0.8485 ACC: 0.9398 max_depth 10 XGBoost with 10 thread costs: 142.092000008 seconds AUC: 0.8416 ACC: 0.9410
4.2.2scale_pos_weight调参
list=[0.2,0.4,0.5,0.6,0.8,1]
经过上个步骤,上下采样的AUC都有不错的表现,但是ACC不足
而未经重采样的数据集的AUC的峰值大约在0.74左右,需要舍弃这个数据集
下采样则是ACC达不到0.94
下面着重在树深为9,10 的基础上调整上采样的ACC结果:上采样选择scale_pos_weight=1.0,树深为9,AUC:0.8485,ACC:0.9398
### 上采样 param['max_depth'] = 10 list2=[0.8,0.9,1.0,1.1,1.2] pam='scale_pos_weight' xgb_model(up_train_new,test_new,list2,pam,Threshold)
training xgboost scale_pos_weight 0.8 XGBoost with 0 thread costs: 145.473999977 seconds AUC: 0.8300 ACC: 0.9456 scale_pos_weight 0.9 XGBoost with 0 thread costs: 141.223999977 seconds AUC: 0.8359 ACC: 0.9430 scale_pos_weight 1.0 XGBoost with 1 thread costs: 143.575999975 seconds AUC: 0.8416 ACC: 0.9410 scale_pos_weight 1.1 XGBoost with 1 thread costs: 171.828999996 seconds AUC: 0.8442 ACC: 0.9392 scale_pos_weight 1.2 XGBoost with 1 thread costs: 165.302000046 seconds AUC: 0.8500 ACC: 0.9364
### 上采样 param['max_depth'] = 9 list2=[0.8,0.9,1.0,1.1,1.2] pam='scale_pos_weight' Threshold=0.5 xgb_model(up_train_new,test_new,list2,pam,Threshold)
training xgboost scale_pos_weight 0.8 XGBoost with 0 thread costs: 142.86500001 seconds AUC: 0.8362 ACC: 0.9445 scale_pos_weight 0.9 XGBoost with 0 thread costs: 145.427999973 seconds AUC: 0.8412 ACC: 0.9413 scale_pos_weight 1.0 XGBoost with 1 thread costs: 141.220000029 seconds AUC: 0.8485 ACC: 0.9398 scale_pos_weight 1.1 XGBoost with 1 thread costs: 144.178000212 seconds AUC: 0.8519 ACC: 0.9363 scale_pos_weight 1.2 XGBoost with 1 thread costs: 143.682999849 seconds AUC: 0.8547 ACC: 0.9349
4.2.3阈值调参
Threshold 0.45-0.55,步进0.01
最终选择:Threshold= 0.51
此时:
AUC: 0.8466
ACC: 0.9409
for m in np.arange(0.45,0.55,0.01):param['scale_pos_weight'] = 1.0list3=[9]pam='max_depth'Threshold=mprint 'Threshold=',mxgb_model(up_train_new,test_new,list3,pam,Threshold)
Threshold= 0.45 training xgboost max_depth 9 XGBoost with 9 thread costs: 162.524000168 seconds AUC: 0.8565 ACC: 0.9337 Threshold= 0.46 training xgboost max_depth 9 XGBoost with 9 thread costs: 160.478999853 seconds AUC: 0.8555 ACC: 0.9352 Threshold= 0.47 training xgboost max_depth 9 XGBoost with 9 thread costs: 153.284000158 seconds AUC: 0.8537 ACC: 0.9363 Threshold= 0.48 training xgboost max_depth 9 XGBoost with 9 thread costs: 153.253999949 seconds AUC: 0.8523 ACC: 0.9375 Threshold= 0.49 training xgboost max_depth 9 XGBoost with 9 thread costs: 152.437000036 seconds AUC: 0.8499 ACC: 0.9384 Threshold= 0.5 training xgboost max_depth 9 XGBoost with 9 thread costs: 143.042000055 seconds AUC: 0.8485 ACC: 0.9398 Threshold= 0.51 training xgboost max_depth 9 XGBoost with 9 thread costs: 136.401000023 seconds AUC: 0.8466 ACC: 0.9409 Threshold= 0.52 training xgboost max_depth 9 XGBoost with 9 thread costs: 139.329999924 seconds AUC: 0.8445 ACC: 0.9418 Threshold= 0.53 training xgboost max_depth 9 XGBoost with 9 thread costs: 138.738999844 seconds AUC: 0.8425 ACC: 0.9428 Threshold= 0.54 training xgboost max_depth 9 XGBoost with 9 thread costs: 156.219000101 seconds AUC: 0.8404 ACC: 0.9436 Threshold= 0.55 training xgboost max_depth 9 XGBoost with 9 thread costs: 164.50999999 seconds AUC: 0.8381 ACC: 0.9446
4.3 重要特征可视化
from xgboost import plot_importance import matplotlib.pyplot as plt from graphviz import Digraph import pydot
param['scale_pos_weight'] = 1.0 list3=[9] pam='max_depth' Threshold=0.51 model=xgb_model(up_train_new,test_new,list3,pam,Threshold)
training xgboost max_depth 9 XGBoost with 9 thread costs: 156.400000095 seconds AUC: 0.8466 ACC: 0.9409
imp_feat=imp_fea.sort_values()[::-1] feat_imp=imp_feat[:30]
feat_imp.plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0xe8124a8>
基于Xgboost的不均衡数据分类相关推荐
- smoteenn算法_基于EasyEnsemble算法和SMOTE算法的不均衡数据分类方法与流程
本发明涉及不均衡数据二分类技术领域,尤其涉及一种基于EasyEnsemble算法和SMOTE算法的不均衡数据二分类方法. 背景技术: 数据不均衡指的是在一个样本数据集中,某一类的样本数远少于其他类的样 ...
- 基于XGBoost的用户流失预测
基于XGBoost的用户流失预测 小P:小H,我怎么能知道哪些用户有可能会流失呢?我这里有一份数据,你帮忙看看哪些字段更有助于寻找流失用户 小H:我只需要告诉你哪些特征更重要是吗? 小P:对对- 小H ...
- 用haproxy结合keepalived实现基于LNMP的负载均衡和高可用
今天我们讲haproxy结合keepalived实现LNMP的负载均衡和高可用,现在的公司大部分都基于haproxy实现负载均衡.下面以一个事例去给大家详细讲解如何去实现: 一.用haproxy结合k ...
- ML之xgboost:基于xgboost(5f-CrVa)算法对HiggsBoson数据集(Kaggle竞赛)训练实现二分类预测(基于训练好的模型进行新数据预测)
ML之xgboost:基于xgboost(5f-CrVa)算法对HiggsBoson数据集(Kaggle竞赛)训练实现二分类预测(基于训练好的模型进行新数据预测) 目录 输出结果 设计思路 核心代码 ...
- ML之xgboost:基于xgboost(5f-CrVa)算法对HiggsBoson数据集(Kaggle竞赛)训练(模型保存+可视化)实现二分类预测
ML之xgboost:基于xgboost(5f-CrVa)算法对HiggsBoson数据集(Kaggle竞赛)训练(模型保存+可视化)实现二分类预测 目录 数据集简介 输出结果 设计思路 核心代码 数 ...
- 基于XGBoost的PU-Learning
论文:Detecting positive and negative deceptive opinions using PU-learning PU-learning是一种只有正样本的半监督的二分类器 ...
- 高可用与负载均衡(5)之基于客户端的负载均衡
什么是客户端负载均衡 基于客户端的负载均衡,简单的说就是在客户端程序里面,自己设定一个调度算法,在向服务器发起请求的时候,先执行调度算法计算出向哪台服务器发起请求,然后再发起请求给服务器. 基于客户端 ...
- NGINX基于Tomcat配置负载均衡
NGINX基于Tomcat配置负载均衡 本部署指南说明了如何使用NGINX开源和NGINX Plus在Apache Tomcat TM应用程序服务器池之间平衡HTTP和HTTPS流量.本指南中的详细说 ...
- 基于 XGBoost 对 Santander 银行用户购买行为进行预测
Santander Product Recommendation 是我去年做的一个数据挖掘 project,简单来说就是,给了一定量的数据,用合适的算法对这些数据进行建模分析,给出预测,从而挖掘出有价 ...
最新文章
- 深度解析 PouchContainer 的富容器技术
- CVPR2021|深度感知镜面分割方法(RGBD数据)
- 五子棋博弈树剪枝c语言,五子棋AI博弈树之带Alpha-Beta剪枝的极大极小过程函数...
- python爬虫代码房-Python爬虫一步步抓取房产信息
- 树莓派 Ubuntu mate 18.04 下开启vncserver
- java求一个数的阶乘_Java如何使用方法计算一个数字的阶乘值?
- 手写自己的MyBatis框架-Executor
- axure文本框值相加_Axure教程:计数文本域实现
- 利用BenchmarkDotNet 测试 .Net Core API 同步和异步方法性能
- C#中string.Concat方法的使用
- Java 计算两个日期相差的天数
- python3记录(1) - 内置函数
- DBA+北京社群第三次线下沙龙归来
- mysql更新一条语句_讲讲一条MySQL更新语句是怎么执行的?
- 人肉搜索、人肉语言及人肉程序设计
- 小米笔记本bios版本大全_聊一款被“差别对待”的笔记本电脑
- 悟空互动:如何让百度更快的收录网站,试试快速收录提交入口!
- android动态指示箭头,自定义选项卡指示器(箭头向下指示器)
- 201709-2公共钥匙盒
- html代码制作的个人简历
热门文章
- MyBatis 开发有bug找不到?多看看执行流程
- 一步步教你怎么用Python写贪吃蛇游戏
- 2022年团体程序设计天梯赛C++个人题解附带解题思路
- oracle leg,[LEG引擎]英雄合击数据库
- 《炬丰科技-半导体工艺》不破坏MEMS结构的颗粒去除方法
- 记一次GLIB2.14升级GLIB2.18的记录以及其中的步骤原理
- the second day
- 关于“为什么选择我们公司?”应聘者如何回答?
- OpenCV 32F 与 8U Mat数据类型相互转换(C++版)
- Linux读取下机数据.fq.gz文件