天池-一起挖掘幸福感！

幸福感是一个古老而深刻的话题，是人类世代追求的方向。与幸福感相关的因素成千上万、因人而异，大如国计民生，小如路边烤红薯，都会对幸福感产生影响。这些错综复杂的因素中，我们能找到其中的共性，一窥幸福感的要义吗？
天池新人实战赛是针对数据新人开设的实战练习专场，以经典赛题作为学习场景，提供详尽入门教程，手把手教你学习数据挖掘。天池希望新人赛能成为高校备受热捧的数据实战课程，帮助更多学生掌握数据技能。
赛题背景
在社会科学领域，幸福感的研究占有重要的位置。这个涉及了哲学、心理学、社会学、经济学等多方学科的话题复杂而有趣；同时与大家生活息息相关，每个人对幸福感都有自己的衡量标准。如果能发现影响幸福感的共性，生活中是不是将多一些乐趣；如果能找到影响幸福感的政策因素，便能优化资源配置来提升国民的幸福感。目前社会科学研究注重变量的可解释性和未来政策的落地，主要采用了线性回归和逻辑回归的方法，在收入、健康、职业、社交关系、休闲方式等经济人口因素；以及政府公共服务、宏观经济环境、税负等宏观因素上有了一系列的推测和发现。
赛题尝试了幸福感预测这一经典课题，希望在现有社会科学研究外有其他维度的算法尝试，结合多学科各自优势，挖掘潜在的影响因素，发现更多可解释、可理解的相关关系。
赛题说明
赛题使用公开数据的问卷调查结果，选取其中多组变量，包括个体变量（性别、年龄、地域、职业、健康、婚姻与政治面貌等等）、家庭变量（父母、配偶、子女、家庭资本等等）、社会态度（公平、信用、公共服务等等），来预测其对幸福感的评价。
幸福感预测的准确性不是赛题的唯一目的，更希望选手对变量间的关系、变量群的意义有所探索与收获。
数据说明
考虑到变量个数较多，部分变量间关系复杂，数据分为完整版和精简版两类。可从精简版入手熟悉赛题后，使用完整版挖掘更多信息。complete文件为变量完整版数据，abbr文件为变量精简版数据。

index文件中包含每个变量对应的问卷题目，以及变量取值的含义。

survey文件是数据源的原版问卷，作为补充以方便理解问题背景。

数据来源：赛题使用的数据来自中国人民大学中国调查与数据中心主持之《中国综合社会调查（CGSS）》项目。赛题感谢此机构及其人员提供数据协助。中国综合社会调查为多阶分层抽样的截面面访调查。

外部数据：赛题以数据挖掘和分析为出发点，不限制外部数据的使用，比如宏观经济指标、政府再分配政策等公开数据，欢迎选手交流分享。
评测指标

代码如下：
首先导入相应的包和库：

import os
import time
import pandas as pd
import numpy as np
import lightgbm as lgb
import seaborn as sns
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from sklearn.metrics import mean_squared_error# 绘图案例 an example of matplotlib
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import jn
from IPython.display import display, clear_output
import time

设置pandas显示

#显示所有列
pd.set_option('display.max_columns',None)
#显示所有行
pd.set_option('display.max_rows',None)

读取影响因素解释数据：

happiness_index = pd.read_excel('happiness_index.xlsx')
happiness_index.head(50)

读取训练数据和测试集数据

train = pd.read_csv("happiness_train_complete.csv", parse_dates=["survey_time"], encoding='latin-1')
test = pd.read_csv("happiness_test_complete.csv", parse_dates=["survey_time"], encoding='latin-1')
train.head()

训练集数据前五行

train.shape

查看训练集中缺失数据

train.isnull().sum().sort_values(ascending=False)

查看测试集中缺失数据

训练集数据描述

删除训练集中无效的标签对应的数据

# 删除训练集中无效的标签对应的数据
train = train.loc[train['happiness'] != -8]

查看各个类别的分布情况，有很明显的类别不均衡的问题

# 查看各个类别的分布情况，有很明显的类别不均衡的问题
f,ax=plt.subplots(1,2,figsize=(18,8))
train['happiness'].value_counts().plot.pie(autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('happiness')
ax[0].set_ylabel('')
train['happiness'].value_counts().plot.bar(ax=ax[1])
ax[1].set_title('happiness')
plt.show()

探究性别和幸福感的分布

# 探究性别和幸福感的分布
sns.countplot('gender',hue='happiness',data=train)
ax[1].set_title('Sex:happiness')

探究年龄和幸福感的关系

# 探究年龄和幸福感的关系
train['survey_time'] = train['survey_time'].dt.year
test['survey_time'] = test['survey_time'].dt.year
train['Age'] = train['survey_time']-train['birth']
test['Age'] = test['survey_time']-test['birth']
del_list=['survey_time','birth']
figure,ax = plt.subplots(1,1)
train['Age'].plot.hist(ax=ax,color='blue')

将年龄分箱，避免噪声和异常值的影响

# 一般会将年龄分箱，避免噪声和异常值的影响
combine=[train,test]for dataset in combine:dataset.loc[dataset['Age']<=16,'Age']=0dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3dataset.loc[(dataset['Age'] > 64) & (dataset['Age'] <= 80), 'Age'] = 4dataset.loc[ dataset['Age'] > 80, 'Age'] = 5
sns.countplot('Age', hue='happiness', data=train)

各个年龄段幸福感分布

figure1,ax1 = plt.subplots(1,5,figsize=(18,4))
train['happiness'][train['Age']==1].value_counts().plot.pie(autopct='%1.1f%%',ax=ax1[0],shadow=True)
train['happiness'][train['Age']==2].value_counts().plot.pie(autopct='%1.1f%%',ax=ax1[1],shadow=True)
train['happiness'][train['Age']==3].value_counts().plot.pie(autopct='%1.1f%%',ax=ax1[2],shadow=True)
train['happiness'][train['Age']==4].value_counts().plot.pie(autopct='%1.1f%%',ax=ax1[3],shadow=True)
train['happiness'][train['Age']==5].value_counts().plot.pie(autopct='%1.1f%%',ax=ax1[4],shadow=True)

相关性计算

缺失值情况

pd.DataFrame(data.isnull().sum()).tail(50)

首先处理时间特征

#处理时间特征
data['survey_time'] = pd.to_datetime(data['survey_time'],format='%Y-%m-%d %H:%M:%S')
data["weekday"]=data["survey_time"].dt.weekday
data["year"]=data["survey_time"].dt.year
data["quarter"]=data["survey_time"].dt.quarter
data["hour"]=data["survey_time"].dt.hour
data["month"]=data["survey_time"].dt.month

对每天接受问卷的时间进行分段

def hour_cut(x):if 0<=x<6:return 0elif  6<=x<8:return 1elif  8<=x<12:return 2elif  12<=x<14:return 3elif  14<=x<18:return 4elif  18<=x<21:return 5elif  21<=x<24:return 6data["hour_cut"]=data["hour"].map(hour_cut)

计算做问卷的时候的年龄

data["survey_age"]=data["year"]-data["birth"]

是否入党

data["join_party"]=data["join_party"].map(lambda x:0 if pd.isnull(x)  else 1)

出生时的年代

def birth_split(x):if 1920<=x<=1930:return 0elif  1930<x<=1940:return 1elif  1940<x<=1950:return 2elif  1950<x<=1960:return 3elif  1960<x<=1970:return 4elif  1970<x<=1980:return 5elif  1980<x<=1990:return 6elif  1990<x<=2000:return 7data["birth_s"]=data["birth"].map(birth_split)

#  依据特征income创造个人收入类别特征
incomes = []
for income in data['income']:if 0 <= income < 200000:incomes.append(1)  elif 200000 <= income < 350000:incomes.append(2)  elif 350000 <= income < 600000:incomes.append(3)  elif 600000 <= income < 800000:incomes.append(4)  elif 800000 <= income < 2000000:incomes.append(5)  elif 2000000 <= income < 5000000:incomes.append(6) elif 5000000 <= income:incomes.append(7)  data['income'] = pd.DataFrame(incomes)

文本数据处理
在所有的特征中，有3个特征分别是 edu_other、property_other、invest_other 是字符串数据，需要将其转换成序号编码（Ordinal Encoding）。

首先查看 edu_other 的填写情况。

data_origin[data_origin['edu_other'] != -1]['edu_other'].to_frame()

可以看到 edu_other 的填写情况全都是夜校，将字符串转换成序号编码。

data_origin['edu_other'] = data_origin['edu_other'].astype('category').values.codes + 1

查看 property_other 即房子产权归属谁，首先检查调查问卷的填写情况。

data_origin[data_origin['property_other'] != -1]['property_other'].to_frame()

根据填写情况来看，其中有很多填写信息都是一个意思，例如家庭共同所有和全家所有是同一个意思，但是在python处理中只能一个个的手动处理

#data_origin.loc[[8009, 9212, 9759, 10517], 'property_other'] = '多人拥有'
#data_origin.loc[[8014, 8056, 10264], 'property_other'] = '未过户'
#data_origin.loc[[8471, 8825, 9597, 9810, 9842, 9967, 10069, 10166, 10203, 10469], 'property_other'] = '全家拥有'
#data_origin.loc[[8553, 8596, 9605, 10421, 10814], 'property_other'] = '无产权'

data_origin.loc[[76, 132, 455, 495, 1415, 2511, 2792, 2956, 3647, 4147, 4193, 4589, 5023, 5382, 5492, 6102, 6272, 6339, 6507, 7184, 7239], 'property_other'] = '无产权'
data_origin.loc[[92, 1888, 2703, 3381, 5654], 'property_other'] = '未过户'
data_origin.loc[[99, 619, 2728, 3062, 3222, 3251, 3696, 5283, 6191, 7295, 7376, 7746, 7821, 7917], 'property_other'] = '全家拥有'
data_origin.loc[[1597, 4993, 5398, 5899, 7240, 7776], 'property_other'] = '多人拥有'
data_origin.loc[[6469, 6891], 'property_other'] = '小产权'

将字符串编码为整数型的序号（ordinal）类型。

data_origin['property_other'] = data_origin['property_other'].astype('category').values.codes + 1

查看 invest_other 即从事的投资活动的填写情况。

pd.DataFrame(data_origin[data_origin['invest_other'] != -1]['invest_other'].unique())

同样地，将其转换成整数类型的序号（ordinal）编码。

data_origin['invest_other'] = data_origin['invest_other'].astype('category').values.codes + 1

data.drop(['survey_time','survey_type','province','city','county','marital_1st','s_birth','marital_now'],axis=1,inplace=True)

针对缺失数据，使用随机森林模型对缺失数据进行填充
使用随机森林对数据进行填充

#  随机森林回归填充的思路就是从缺失值数目最少的开始填充
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressordatas_pre['inc_ability'] = pd.to_numeric(datas_pre['inc_ability'])y_full = data_train['happiness']
data_pre_reg = datas_pre.copy()
sort_index = np.argsort(data_pre_reg.isnull().sum(axis=0)).valuesdata_pre_reg.columns = [x for x in range(len(data_pre_reg.columns))]

%%time
for i in sort_index:df = data_pre_reg#  构建新标签fillc = df.iloc[:,i]#  构建新特征矩阵df = pd.concat([df.iloc[:,df.columns != i],pd.DataFrame(y_full)],axis=1)#  对于新的特征矩阵中，用0进行填充imp_0 = SimpleImputer(missing_values=np.nan,strategy='constant',fill_value=0)df_0 = pd.DataFrame(imp_0.fit_transform(df))#  挑选出不缺失的标签Ytrain = fillc[fillc.notnull()]#  需要Ytest的index啊Ytest = fillc[fillc.isnull()]Xtrain = df_0.iloc[Ytrain.index,:]Xtest = df_0.iloc[Ytest.index,:]#  建立随机森林回归模型rfc = RandomForestRegressor(n_estimators=100,n_jobs=-1)rfc = rfc.fit(Xtrain,Ytrain)Ypredict = rfc.predict(Xtest)data_pre_reg.loc[data_pre_reg.iloc[:,i].isnull(),i] = Ypredict

查看缺失数据

data_pre_reg.isnull().sum()

模型建立与调参数

#  切分数据集
from sklearn.model_selection import train_test_split
Xtrain,Xtest,ytrain,ytest = train_test_split(X_train,y_full,test_size=0.3,random_state=1227)

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from xgboost.sklearn import XGBRegressor
from lightgbm.sklearn import LGBMRegressor

model_names = [#'linear_reg',#'RandomForestRegressor','GradientBoostingRegressors','svr','KNeighborsRegressors','AdaBoostRegressors',#'XGBoost',#'lightGBM']models = [# LinearRegression(),#RandomForestRegressor(random_state=666),GradientBoostingRegressor(random_state=666),SVR(),KNeighborsRegressor(),AdaBoostRegressor(random_state=666),]

parm_grids = [#     {'RandomForestRegressor__max_depth':[3,6,9],'RandomForestRegressor__n_estimators':[10,50,70],
#     'RandomForestRegressor__min_samples_split':[3,5,7,9],'RandomForestRegressor__min_samples_leaf':range(2,5)
#     },{'GradientBoostingRegressors__n_estimators':[10,50,100],'GradientBoostingRegressors__max_depth':range(2,9),'GradientBoostingRegressors__min_samples_split':[3,5,7,9],'GradientBoostingRegressors__min_samples_leaf':range(2,5)},{'svr__degree':[2,3,4]},{'KNeighborsRegressors__n_neighbors':[2,5,7,9,10]},{'AdaBoostRegressors__n_estimators':[2,9,10,13,15,20,50,100]}]

def Grid(pipeline,train_x,train_y,test_x,test_y,param_grid):response = {}gridsearch = GridSearchCV(pipeline,param_grid=param_grid,cv=3)search = gridsearch.fit(train_x,train_y)print('最优参数：',search.best_params_)print('最优参数(R^2)：%0.4lf' % search.best_score_)predict_y = gridsearch.predict(test_x)mse = mean_squared_error(ytest,predict_y).mean()response['mse'] = msereturn response

%%time
for model,model_name,parm_grid in zip(models,model_names,parm_grids):#print(model_name,model)pipeline = Pipeline([#('sta',StandardScaler()),#('pca',PCA()),(model_name,model),])result = Grid(pipeline,Xtrain,ytrain,Xtest,ytest,parm_grid)print(result)

XGBoost与lightGBM的建立与参数

#  试试两大神器
from xgboost.sklearn import XGBRegressor
from lightgbm.sklearn import LGBMRegressor

#  试试默认参数下的xgboost
xgboo = XGBRegressor().fit(Xtrain,ytrain)
predict = xgboo.predict(Xtest)
print('XGBOOST的mse为:',mean_squared_error(ytest,predict))
print('XGBOOST的r2为:',r2_score(ytest,predict))

#  试试默认参数下的lightlgbm
lgbm = LGBMRegressor().fit(Xtrain,ytrain)
predict = lgbm.predict(Xtest)
print('LGBM的mse为:',mean_squared_error(ytest,predict))
print('LGBM的r2为:',r2_score(ytest,predict))

#   最优参数
xgb1 = XGBRegressor(max_depth=6,learning_rate=0.01,n_estimators=3000,silent=False,objective='reg:squarederror',booster='gbtree',n_jobs=-1,gamma=5.4,min_child_weight=6,subsample=0.8,colsample_bytree=1,reg_lambda=1.39,seed=7)#  然后训练模型、测试集预测、获得r2得分
xgb1_best2 = xgb1.fit(Xtrain,ytrain)predicts = xgb1_best2.predict(Xtest)print('最优模型的mse:',mean_squared_error(ytest,predicts))
print('最优模型的r2:',r2_score(ytest,predicts))

model = xgb1.fit(Xtrain,ytrain)

模型保存

model.save_model(r'The Best Model')

使用模型进行预测

pre = model.predict(X_test)

天池-一起挖掘幸福感！相关推荐

阿里天池新人赛——幸福感挖掘
本文简要介绍参加阿里天池新人赛--幸福感挖掘的相关思路整体思路 1.分析问题,提出分析目的 2.数据清洗.数据预处理及数据可视化 3.数据分析 4.建模计算 5.分析结果及竞赛成绩 1.分析问题,提 ...
机器学习训练营--快来一起挖掘幸福感吧
文章目录前言一.赛题理解 1.1 实验环境 1.2 背景介绍 1.3 数据信息 1.4 评价指标二.探索性数据分析(EDA)& 特征工程 2.1 为什么要做探索性数据分析 2.2 探索性 ...
数据挖掘竞赛-一起挖掘幸福感EDA
一起挖掘幸福感简介天池上的一个新人赛,属于比较简单的回归赛(由于target只有5个值也有人理解为分类赛,但是注意平台评分使用的是MSE,分类会使得分不合适),适合作为EDA的教程.本文内容将主要 ...
机器学习赛事：快来一起挖掘幸福感
快来一起挖掘幸福感(完整篇) 本学习笔记为阿里云天池龙珠计划机器学习训练营的学习内容,学习链接为:AI训练营机器学习-阿里云天池赛题介绍(虽然上面链接里已经有了赛题介绍,但我还是把它摘抄下来了,绝对 ...
Task10快来一起挖掘幸福感大赛
Task10快来一起挖掘幸福感大赛参考资料机器学习训练营_天池龙珠计划:https://tianchi.aliyun.com/specials/promotion/aicampml?invite_ ...
天池新人赛幸福感数据分析+预测
天池新人赛幸福感预测赛题链接 https://tianchi.aliyun.com/competition/entrance/231702/introduction 本文将按以下几个步骤描述,数据分析 ...
机器学习赛事（四）：快来一起挖掘幸福感
机器学习赛事(四):快来一起挖掘幸福感机器学习训练营: https://tianchi.aliyun.com/s/20d6735792ef867814c90698221d1499 比赛题目赛 ...
天池竞赛入门实战——快来一起挖掘幸福感！
天池算法大赛是阿里巴巴的(阿里云) 赛题链接: https://tianchi.aliyun.com/competition/entrance/231702/introduction 数据获取数据清 ...
天池“幸福感预测”比赛-2019
"幸福感预测"Project报告 1 赛题简介本赛题是天池上的一个数据挖掘类型的比赛--快来一起挖掘幸福感.比赛的数据使用的是官方的<中国综合社会调查(CGSS)>文 ...

天池-一起挖掘幸福感！

天池-一起挖掘幸福感！相关推荐

最新文章

热门文章