阿里云 快来一起挖掘幸福感!项目实战
项目地址

1,数据准备

问卷调查数据主要包含的个人信息有职业,婚姻状况,收入,学历等40个features,label就是幸福感。

首先对数据进行预处理

(一) 第二列是数据特征,先独立抽取出来。

(二) 第七列是时间,本身对结果影响不大,又由于是字符串,暂时删除不用。

(三) 数据本身包含20197个空缺,对数据进行补充。

2,训练和预测

给定的数据包含train和test两部分,先用train分成两部分对建立的模型进行训练评分,最后对test预测。

3,代码实战

3.1,进行数据预处理

import pandas as pd
import numpy as npdatatrain = pd.read_csv('happiness_train_complete.csv',encoding="gb2312")
datatest = pd.read_csv('happiness_test_complete.csv',encoding="gb2312")
dataplot = datatrain.copy()datatrain = datatrain[datatrain["happiness"]!=-8].reset_index(drop=True)
dataplot = dataplot[dataplot["happiness"]!=-8].reset_index(drop=True)target_col = "happiness"
target = datatrain[target_col]del datatrain['id']
del datatest['id']label = datatrain['happiness']del datatrain['happiness']
dataproc = pd.concat([datatrain,datatest],ignore_index=True)dataproc['survey_type'] = dataproc['survey_type'].map(lambda x:x-1) #变0-1count = []
for i in range(1,32):count.append(dataplot.loc[dataplot['province']==i,'happiness'].mean())
count = [i if (1-pd.isnull(i)) else 3 for i in count]
#plt.scatter(range(1,32),count)
reg1 = [i for i in range(1,32) if count[i-1]<3.2]
reg2 = [i for i in range(1,32) if 3.2<count[i-1]<3.9]
reg3 = [i for i in range(1,32) if count[i-1]>=3.9]
def spl(x):if x in [2,3,8,13,14,20,23,25,26,30]:return 0else:return 1
def spl1(x):if x in reg1:return 0elif x in reg2:return 1elif x in reg3:return 2
dataproc['province_1'] = dataproc['province'].map(spl) #新增两个变量
dataproc['province_2'] = dataproc['province'].map(spl1)
dataproc['gender'] = dataproc['gender'].map(lambda x:x-1) #变0-1
dataproc['age'] = dataproc['survey_time'].map(lambda x:int(x[:4]))-dataproc['birth']dataproc.loc[dataproc['nationality']<0,'nationality'] = 1
dataproc = dataproc.join(pd.get_dummies(dataproc["nationality"],prefix="nationality"))def nation(x):if x==1:return 1else:return 0
dataproc['nationality1'] = dataproc['nationality'].map(nation)#新特征,是否为汉族
del dataproc['nationality']def relfreq(x):if x<2:return 0elif x<5:return 1else:return 2
dataproc['religion_freq'] = dataproc['religion_freq'].map(relfreq)from scipy import stats
dataproc.loc[dataproc['edu']<0,'edu'] = stats.mode(dataproc['edu'])[0][0]del dataproc['edu_other']dataproc = dataproc.join(pd.get_dummies(dataproc["edu_status"],prefix="edu_status"))
del dataproc["edu_status"]def eduyr(x):if (x>0) and (not pd.isnull(x)):return xelse:return 0
dataproc['edu_yr'] = dataproc['edu_yr'].map(eduyr)
dataproc['edu_yr'] = dataproc['edu_yr']-dataproc['birth']
def eduyr1(x):if x>0:return xelse:return 0
dataproc['edu_yr'] = dataproc['edu_yr'].map(eduyr1)dataproc.loc[dataproc['income']<0,'income'] = stats.mode(dataproc['income'])[0][0]
dataproc['income'] = dataproc['income'].map(lambda x:np.log(x+1))dataproc.loc[dataproc['political']<0,'political'] = 1
dataproc = dataproc.join(pd.get_dummies(dataproc["political"],prefix="political"))
del dataproc['political']def joinparty(x):if pd.isnull(x):return 0if x<0:return 0else:return x
dataproc['join_party'] = (dataproc['join_party']-dataproc['birth']).map(joinparty)del dataproc['property_other']dataproc.loc[(dataproc['weight_jin']<=80)&(dataproc['height_cm']>=160),'weight_jin']= dataproc['weight_jin']*2 #对体重修正
dataproc.loc[dataproc['weight_jin']<=60,'weight_jin']= dataproc['weight_jin']*2
dataproc['bmi'] = dataproc['weight_jin'].map(lambda x:x/2)/dataproc['height_cm'].map(lambda x:(x/100)**2)
dataproc.loc[dataproc['health']<0,'health'] = stats.mode(dataproc['health'])[0][0]
dataproc.loc[dataproc['health_problem']<0,'health_problem'] = stats.mode(dataproc['health_problem'])[0][0]
dataproc.loc[dataproc['depression']<0,'depression'] = stats.mode(dataproc['depression'])[0][0]
dataproc.loc[dataproc['media_1']<0,'media_1'] = stats.mode(dataproc['media_1'])[0][0]
dataproc.loc[dataproc['media_2']<0,'media_2'] = stats.mode(dataproc['media_2'])[0][0]
dataproc.loc[dataproc['media_3']<0,'media_3'] = stats.mode(dataproc['media_3'])[0][0]
dataproc.loc[dataproc['media_4']<0,'media_4'] = stats.mode(dataproc['media_4'])[0][0]
dataproc.loc[dataproc['media_5']<0,'media_5'] = stats.mode(dataproc['media_5'])[0][0]
dataproc.loc[dataproc['media_6']<0,'media_6'] = stats.mode(dataproc['media_6'])[0][0]dataproc['media'] = (dataproc['media_1']+dataproc['media_2']+dataproc['media_3']+dataproc['media_4']+dataproc['media_5']+dataproc['media_6']).map(lambda x:x/6)for i in range(1,13):dataproc.loc[dataproc['leisure_'+str(i)]<0,'leisure_'+str(i)] = stats.mode(dataproc['leisure_'+str(i)])[0][0]dataproc['leisure'] = (dataproc['leisure_1']+dataproc['leisure_2']+dataproc['leisure_3']+dataproc['leisure_4']+dataproc['leisure_5']+dataproc['leisure_6']+dataproc['leisure_7']+dataproc['leisure_8']+dataproc['leisure_9']+dataproc['leisure_10']+dataproc['leisure_11']+dataproc['leisure_12']).map(lambda x:x/12)dataproc.loc[dataproc['socialize']<0,'socialize'] = stats.mode(dataproc['socialize'])[0][0]
dataproc.loc[dataproc['relax']<0,'relax'] = stats.mode(dataproc['relax'])[0][0]
dataproc.loc[dataproc['learn']<0,'learn'] = stats.mode(dataproc['learn'])[0][0]
socialneimode = stats.mode(dataproc['social_neighbor'])[0][0]
def socialnei(x):if pd.isnull(x):return socialneimodeif x<0:return socialneimodeelse:return xdataproc['social_neighbor'] = dataproc['social_neighbor'].map(socialnei)
socialfrimode = stats.mode(dataproc['social_friend'])[0][0]
def socialfri(x):if pd.isnull(x):return socialfrimodeif x<0:return socialfrimodeelse:return xdataproc['social_friend'] = dataproc['social_friend'].map(socialfri)
dataproc.loc[dataproc['socia_outing']<0,'socia_outing'] = stats.mode(dataproc['socia_outing'])[0][0]
dataproc.loc[dataproc['equity']<0,'equity'] = stats.mode(dataproc['equity'])[0][0]
dataproc.loc[dataproc['class']<0,'class'] = stats.mode(dataproc['class'])[0][0]
dataproc.loc[dataproc['class_10_before']<0,'class_10_before'] = stats.mode(dataproc['class_10_before'])[0][0]
dataproc.loc[dataproc['class_10_after']<0,'class_10_after'] = stats.mode(dataproc['class_10_after'])[0][0]
dataproc['class_new_1'] = dataproc['class'] - dataproc['class_10_before'] #构造新特征
dataproc['class_new_2'] = dataproc['class'] - dataproc['class_10_after']
dataproc.loc[dataproc['class_14']<0,'class_14'] = stats.mode(dataproc['class_14'])[0][0]
dataproc = dataproc.join(pd.get_dummies(dataproc["work_exper"],prefix="work_exper"))
def workstat(x):if pd.isnull(x):return 9if x<0:return 9else:return x
dataproc['work_status'] = dataproc['work_status'].map(workstat)
dataproc = dataproc.join(pd.get_dummies(dataproc["work_status"],prefix="work_status"))data = dataproc# work_typedata['work_type'] = data['work_type'].fillna(100)
data = pd.concat([data,pd.get_dummies(data['work_type'],prefix= 'work_type')],1)
del data['work_type']# work_managedata['work_manage'] = data['work_manage'].fillna('b')
data = pd.concat([data,pd.get_dummies(data['work_manage'],prefix= 'work_manage')],1)
del data['work_manage']# insurdata['insur_1'] = ['other' if i != 1 and i!=2 else i for i in list(data['insur_1'])]
data = pd.concat([data,pd.get_dummies(data['insur_1'],prefix= 'insur_1')],1)
del data['insur_1']data['insur_2'] = ['other' if i != 1 and i!=2 else i for i in list(data['insur_2'])]
data = pd.concat([data,pd.get_dummies(data['insur_2'],prefix= 'insur_2')],1)
del data['insur_2']data['insur_3'] = ['other' if i != 1 and i!=2 else i for i in list(data['insur_3'])]
data = pd.concat([data,pd.get_dummies(data['insur_3'],prefix= 'insur_3')],1)
del data['insur_3']data['insur_4'] = ['other' if i != 1 and i!=2 else i for i in list(data['insur_4'])]
data = pd.concat([data,pd.get_dummies(data['insur_4'],prefix= 'insur_4')],1)
del data['insur_4']# family incomemedian = np.median(data[data['family_income']>=0]['family_income'])
data['family_income'] = [i if i >=0 else median for i in data['family_income']]median = np.median(data[data['income']>=0]['income'])
data['income'] = [i if i >=0 else median for i in data['income']]data['income_family_income'] = data['income']/data['family_income']# cardata = pd.concat([data,pd.get_dummies(data['car'],prefix= 'car')],1)
del data['car']# invest_otherdel data['invest_other']# maritaldata = pd.concat([data,pd.get_dummies(data['marital'],prefix= 'marital')],1)
del data['marital']del data['marital_1st']# s_politicaldata = pd.concat([data,pd.get_dummies(data['s_political'],prefix= 's_political')],1)
del data['s_political']# s_hukoudata = pd.concat([data,pd.get_dummies(data['s_hukou'],prefix= 's_hukou')],1)
del data['s_hukou']# s_incomemedian = np.median(data[data['s_income']>=0]['s_income'])
data['s_income'] = [i if i >=0 else median for i in data['s_income']]# s_work_experdata = pd.concat([data,pd.get_dummies(data['s_work_exper'],prefix= 's_work_exper')],1)
del data['s_work_exper']# s_work_statusdata = pd.concat([data,pd.get_dummies(data['s_work_status'],prefix= 's_work_status')],1)
del data['s_work_status']# s_work_type
data = pd.concat([data,pd.get_dummies(data['s_work_type'],prefix= 's_work_type')],1)
del data['s_work_type']# f_political
data = pd.concat([data,pd.get_dummies(data['f_political'],prefix= 'f_political')],1)
del data['f_political']# view
data = pd.concat([data,pd.get_dummies(data['view'],prefix= 'view')],1)
del data['view']# inc_exp
median = np.median(data[data['inc_exp']>=0]['inc_exp'])
data['inc_exp'] = [median if i <0 else i for i in data['inc_exp']]data['inc_exp_cha'] = data['inc_exp'] - data['income']colnames = list(data.columns)# minor_childmode = data[colnames[77]].mode().values[0]
data['minor_child'] = [i if i >=0 else mode for i in data['minor_child']]# s_birthdel data['s_birth']
# marital_nowdel data['marital_now']
# s_edumode = data[colnames[80]].mode().values[0]
data['s_edu'] = [i if i >=0 else mode for i in data['s_edu']]
# work_yrdel data['work_yr']
# hukou_locdel data['hukou_loc']
# income_family_incomemode = data[colnames[77]].mode().values[0]
data['income_family_income'] = data['income_family_income'].fillna(np.mean(data['income_family_income']))del data['birth']
del data['province']
del data['city']
del data['county']
del data['f_birth']def ff(x):if x == np.inf:return 0else:return x
data['income_family_income'] = list(map(ff,data['income_family_income']))train_shape = datatrain.shape[0]use_fea = [clo for clo in data.columns if clo!='survey_time' and data[clo].dtype!=object]
x_train = data[:train_shape][use_fea].values
y_train = target
x_test = data[train_shape:][use_fea].valuespd.DataFrame(x_train,columns=use_fea).to_csv('x_train.csv',index = False)
pd.DataFrame(x_test,columns=use_fea).to_csv('x_test.csv',index = False)
pd.DataFrame(list(y_train),columns=['target']).to_csv('y_train.csv',index = False)

3.2,定义将要使用的模型

from hyperopt import hp
import numpy as np############
## Config ##
############debug = False## xgboost
xgb_random_seed = 2019
xgb_nthread = 2
xgb_dmatrix_silent = True## sklearn
skl_random_seed = 2019
skl_n_jobs = 2if debug:xgb_nthread = 1skl_n_jobs = 1xgb_min_num_round = 5xgb_max_num_round = 10xgb_num_round_step = 5skl_min_n_estimators = 5skl_max_n_estimators = 10skl_n_estimators_step = 5libfm_min_iter = 5libfm_max_iter = 10iter_step = 5hyperopt_param = {}hyperopt_param["xgb_max_evals"] = 1hyperopt_param["rf_max_evals"] = 1hyperopt_param["etr_max_evals"] = 1hyperopt_param["gbm_max_evals"] = 1hyperopt_param["lr_max_evals"] = 1hyperopt_param["ridge_max_evals"] = 1hyperopt_param["lasso_max_evals"] = 1hyperopt_param['svr_max_evals'] = 1hyperopt_param['dnn_max_evals'] = 1hyperopt_param['libfm_max_evals'] = 1hyperopt_param['rgf_max_evals'] = 1
else:xgb_min_num_round = 10xgb_max_num_round = 500xgb_num_round_step = 10skl_min_n_estimators = 100skl_max_n_estimators = 1000skl_n_estimators_step = 20libfm_min_iter = 10libfm_max_iter = 500iter_step = 10hyperopt_param = {}hyperopt_param["xgb_max_evals"] = 200hyperopt_param["rf_max_evals"] = 200hyperopt_param["etr_max_evals"] = 200hyperopt_param["gbm_max_evals"] = 200hyperopt_param["lr_max_evals"] = 200hyperopt_param["ridge_max_evals"] = 200hyperopt_param["lasso_max_evals"] = 200hyperopt_param['svr_max_evals'] = 200hyperopt_param['dnn_max_evals'] = 200hyperopt_param['libfm_max_evals'] = 200hyperopt_param['rgf_max_evals'] = 200########################################
## Parameter Space for XGBoost models ##
########################################
## In the early stage of the competition, I mostly focus on
## raw tfidf features and linear booster.## regression with linear booster
param_space_xgb_reg = {'task': 'xgb_reg','eta' : hp.quniform('eta', 0.01, 1, 0.01),'lambda' : hp.quniform('lambda', 0, 5, 0.05),'alpha' : hp.quniform('alpha', 0, 0.5, 0.005),'lambda_bias' : hp.quniform('lambda_bias', 0, 3, 0.1),'num_round' : hp.quniform('num_round', xgb_min_num_round, xgb_max_num_round, xgb_num_round_step),'nthread': xgb_nthread,'silent' : 1,'seed': xgb_random_seed,"max_evals": hyperopt_param["xgb_max_evals"],
}########################################
## Parameter Space for Sklearn Models ##
########################################## random forest regressor
param_space_reg_skl_rf = {'task': 'reg_skl_rf','n_estimators': hp.quniform("n_estimators", skl_min_n_estimators, skl_max_n_estimators, skl_n_estimators_step),'max_features': hp.quniform("max_features", 0.05, 1.0, 0.05),'max_depth': hp.quniform('max_depth', 1, 30, 1),'min_samples_split': hp.quniform('min_samples_split', 2, 10, 1),'min_samples_leaf': hp.quniform('min_samples_leaf', 1, 10, 1),'n_jobs': skl_n_jobs,'random_state': skl_random_seed,"max_evals": hyperopt_param["rf_max_evals"],
}## extra trees regressor
param_space_reg_skl_etr = {'task': 'reg_skl_etr','n_estimators': hp.quniform("n_estimators", skl_min_n_estimators, skl_max_n_estimators, skl_n_estimators_step),'max_features': hp.quniform("max_features", 0.05, 1.0, 0.05),'n_jobs': skl_n_jobs,'random_state': skl_random_seed,"max_evals": hyperopt_param["etr_max_evals"],
}## gradient boosting regressor
param_space_reg_skl_gbm = {'task': 'reg_skl_gbm','n_estimators': hp.quniform("n_estimators", skl_min_n_estimators, skl_max_n_estimators, skl_n_estimators_step),'learning_rate': hp.quniform("learning_rate", 0.01, 0.5, 0.01),'max_features': hp.quniform("max_features", 0.05, 1.0, 0.05),'max_depth': hp.quniform('max_depth', 1, 15, 1),'subsample': hp.quniform('subsample', 0.5, 1, 0.1),'min_samples_split': hp.quniform('min_samples_split', 2, 10, 1),'min_samples_leaf': hp.quniform('min_samples_leaf', 1, 10, 1),'random_state': skl_random_seed,"max_evals": hyperopt_param["gbm_max_evals"],
}## support vector regression
param_space_reg_skl_svr = {'task': 'reg_skl_svr','C': hp.loguniform("C", np.log(1), np.log(100)),'gamma': hp.loguniform("gamma", np.log(0.001), np.log(0.1)),'degree': hp.quniform('degree', 1, 5, 1),'epsilon': hp.loguniform("epsilon", np.log(0.001), np.log(0.1)),    'kernel': hp.choice('kernel', ['rbf', 'poly']),"max_evals": hyperopt_param["svr_max_evals"],
}## ridge regression
param_space_reg_skl_ridge = {'task': 'reg_skl_ridge','alpha': hp.loguniform("alpha", np.log(0.01), np.log(20)),'random_state': skl_random_seed,"max_evals": hyperopt_param["ridge_max_evals"],
}## lasso
param_space_reg_skl_lasso = {'task': 'reg_skl_lasso','alpha': hp.loguniform("alpha", np.log(0.00001), np.log(0.1)),'random_state': skl_random_seed,"max_evals": hyperopt_param["lasso_max_evals"],
}######################################
## Parameter Space for Keras Models ##
######################################## integer features
int_feat = ["num_round", "n_estimators", "max_depth", "degree",'min_samples_split','min_samples_leaf',"hidden_units", "hidden_layers", "batch_size", "nb_epoch","dim", "iter","max_leaf_forest", "num_iteration_opt", "num_tree_search", "min_pop", "opt_interval"]####################
## All the Models ##
####################
feat_names = []
param_spaces = {}#############
## xgboost ##
#############
## regression with xgboost tree booster
feat_name = "xgb_reg"
feat_names.append( feat_name )
param_spaces[feat_name] = param_space_xgb_reg#############
## Sklearn ##
#############
## extra trees regressor
feat_name = "reg_skl_etr"
feat_names.append( feat_name )
param_spaces[feat_name] = param_space_reg_skl_etr## random forest regressor
feat_name = "reg_skl_rf"
feat_names.append( feat_name )
param_spaces[feat_name] = param_space_reg_skl_rf## gradient boosting regressor
feat_name = "reg_skl_gbm"
feat_names.append( feat_name )
param_spaces[feat_name] = param_space_reg_skl_gbm## support vector regression
feat_name = "reg_skl_svr"
feat_names.append( feat_name )
param_spaces[feat_name] = param_space_reg_skl_svr## ridge regression
feat_name = "reg_skl_ridge"
feat_names.append( feat_name )
param_spaces[feat_name] = param_space_reg_skl_ridge## lasso
feat_name = "reg_skl_lasso"
feat_names.append( feat_name )
param_spaces[feat_name] = param_space_reg_skl_lasso

3.3,训练得到每一个模型的最优参数

import pandas as pd
import numpy as npfrom model_param_opt import hyperopt_wrapper,hyperopt_train_test
from model_library import feat_names,param_spaces
from hyperopt import fmin,Trials,tpefrom sklearn.model_selection import train_test_splitdef data_prepare():x_train = pd.read_csv('x_train.csv').valuesy_train = pd.read_csv('y_train.csv').values# x_test = pd.read_csv('x_test.csv').valuesy_train = [i[0] for i in y_train]data_used = list(train_test_split(x_train,y_train,test_size = 0.3))return data_usedif __name__ == '__main__':data_used = data_prepare()with open('model_log_best_params.txt','w') as f:for i in range(len(feat_names)):param_space = param_spaces[feat_names[i]]trials = Trials()objective = lambda p: hyperopt_wrapper(p, data_used)best_params = fmin(objective, param_space, algo=tpe.suggest,trials=trials, max_evals=param_space["max_evals"])print(best_params)f.write('%s;%s'%(feat_names[i],str(best_params)))f.write('\n')

3.4,将调过参的模型进行集成得到最后的结果

from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.linear_model import Ridge, Lasso,BayesianRidge
from sklearn.ensemble import  RandomForestRegressor
from sklearn.ensemble import  ExtraTreesRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVRfrom sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_errorimport numpy as np
import pandas as pd
import xgboost as xgbfrom sklearn.model_selection import train_test_split
from sklearn.model_selection import  KFold,RepeatedKFold
from model_library import int_featdef int_feat_f(param):for f in int_feat:if f in param:param[f] = int(param[f])return paramdef ensemble_run():x_train = pd.read_csv('x_train.csv').valuesx_test = pd.read_csv('x_test.csv').valuestarget = pd.read_csv('y_train.csv')y_train = np.array([i[0] for i in target.values])train_shape = len(y_train)X_train = x_trainX_test = x_testwith open('model_log_best_params.txt','r') as f:a = f.readlines()model_param_dict = {i.split(";")[0]:eval(i.split(";")[1][:-1]) for i in a}##### xgbprint('xgb')xgb_params = int_feat_f(model_param_dict['xgb_reg'])folds = KFold(n_splits=5, shuffle=True, random_state=2019)oof_xgb = np.zeros(train_shape)predictions_xgb = np.zeros(len(X_test))for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)):print("fold n°{}".format(fold_+1))model = xgb.XGBRegressor(**xgb_params).fit(X_train[trn_idx], y_train[trn_idx])pre = model.predict(X_train[val_idx])oof_xgb[val_idx] = model.predict(X_train[val_idx])predictions_xgb += model.predict(X_test) / folds.n_splitsprint("CV score: {:<8.8f}".format(mean_squared_error(oof_xgb, y_train)))##### ExtraTreesRegressorprint('ExtraTreesRegressor')etr_params = int_feat_f(model_param_dict['reg_skl_etr'])folds = KFold(n_splits=5, shuffle=True, random_state=2019)oof_etr = np.zeros(train_shape)predictions_etr = np.zeros(len(X_test))for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)):print("fold n°{}".format(fold_+1))model = ExtraTreesRegressor(**etr_params).fit(X_train[trn_idx], y_train[trn_idx])pre = model.predict(X_train[val_idx])oof_etr[val_idx] = model.predict(X_train[val_idx])predictions_etr += model.predict(X_test) / folds.n_splitsprint("CV score: {:<8.8f}".format(mean_squared_error(oof_etr, y_train)))##### RandomForestRegressorprint('RandomForestRegressor')rfr_params = int_feat_f(model_param_dict['reg_skl_rf'])folds = KFold(n_splits=5, shuffle=True, random_state=2019)oof_rfr = np.zeros(train_shape)predictions_rfr = np.zeros(len(X_test))for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)):print("fold n°{}".format(fold_+1))model = RandomForestRegressor(**rfr_params).fit(X_train[trn_idx], y_train[trn_idx])pre = model.predict(X_train[val_idx])oof_rfr[val_idx] = model.predict(X_train[val_idx])predictions_rfr += model.predict(X_test) / folds.n_splitsprint("CV score: {:<8.8f}".format(mean_squared_error(oof_rfr, y_train)))##### GradientBoostingRegressorprint('GradientBoostingRegressor')gbr_params = int_feat_f(model_param_dict['reg_skl_gbm'])folds = KFold(n_splits=5, shuffle=True, random_state=2019)oof_gbr = np.zeros(train_shape)predictions_gbr = np.zeros(len(X_test))for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)):print("fold n°{}".format(fold_+1))model = GradientBoostingRegressor(**gbr_params).fit(X_train[trn_idx], y_train[trn_idx])pre = model.predict(X_train[val_idx])oof_gbr[val_idx] = model.predict(X_train[val_idx])predictions_gbr += model.predict(X_test) / folds.n_splitsprint("CV score: {:<8.8f}".format(mean_squared_error(oof_gbr, y_train)))##### SVRprint('SVR')svr_params = int_feat_f(model_param_dict['reg_skl_svr'])kernal_choice = ['rbf', 'poly']svr_params['kernel'] = kernal_choice[svr_params['kernel']]folds = KFold(n_splits=5, shuffle=True, random_state=2019)oof_svr = np.zeros(train_shape)predictions_svr = np.zeros(len(X_test))for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)):print("fold n°{}".format(fold_+1))model = SVR(**svr_params).fit(X_train[trn_idx], y_train[trn_idx])pre = model.predict(X_train[val_idx])oof_svr[val_idx] = model.predict(X_train[val_idx])predictions_svr += model.predict(X_test) / folds.n_splitsprint("CV score: {:<8.8f}".format(mean_squared_error(oof_svr, y_train)))##### Ridgeprint('Ridge')ridge_params = int_feat_f(model_param_dict['reg_skl_ridge'])folds = KFold(n_splits=5, shuffle=True, random_state=2019)oof_ridge = np.zeros(train_shape)predictions_ridge = np.zeros(len(X_test))for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)):print("fold n°{}".format(fold_+1))model = Ridge(**ridge_params).fit(X_train[trn_idx], y_train[trn_idx])pre = model.predict(X_train[val_idx])oof_ridge[val_idx] = model.predict(X_train[val_idx])predictions_ridge += model.predict(X_test) / folds.n_splitsprint("CV score: {:<8.8f}".format(mean_squared_error(oof_ridge, y_train)))##### Lassoprint('Lasso')Lasso_params = int_feat_f(model_param_dict['reg_skl_lasso'])folds = KFold(n_splits=5, shuffle=True, random_state=2019)oof_Lasso = np.zeros(train_shape)predictions_Lasso = np.zeros(len(X_test))for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)):print("fold n°{}".format(fold_+1))model = Ridge(**Lasso_params).fit(X_train[trn_idx], y_train[trn_idx])pre = model.predict(X_train[val_idx])oof_Lasso[val_idx] = model.predict(X_train[val_idx])predictions_Lasso += model.predict(X_test) / folds.n_splitsprint("CV score: {:<8.8f}".format(mean_squared_error(oof_Lasso, y_train)))#stackingprint('stacking')train_stack = np.vstack([oof_xgb,oof_etr,oof_gbr,oof_Lasso,oof_rfr,oof_ridge,oof_svr]).transpose()test_stack = np.vstack([predictions_xgb, predictions_etr,predictions_gbr, predictions_Lasso,predictions_rfr, predictions_ridge,predictions_svr]).transpose()folds_stack = RepeatedKFold(n_splits=5, n_repeats=2, random_state=2019)oof_stack = np.zeros(train_stack.shape[0])predictions = np.zeros(test_stack.shape[0])for fold_, (trn_idx, val_idx) in enumerate(folds_stack.split(train_stack,target)):print("fold {}".format(fold_))trn_data, trn_y = train_stack[trn_idx], target.iloc[trn_idx].valuesval_data, val_y = train_stack[val_idx], target.iloc[val_idx].valuesclf_3 = BayesianRidge()clf_3.fit(trn_data, trn_y.ravel())oof_stack[val_idx] = clf_3.predict(val_data)predictions += clf_3.predict(test_stack) / 10mean_squared_error(target.values, oof_stack) submit_example = pd.read_csv('happiness_submit.csv',encoding="gb2312")submit_example['happiness'] = predictionssubmit_example.to_csv('result.csv',index = False)if __name__ == '__main__':ensemble_run()

机器学习算法:根据幸福感问卷调查做预测相关推荐

  1. 机器学习算法手把手实战:KNN预测城市空气质量

    AI派在读学生小姐姐Beyonce Java实战项目练习群 长按识别下方二维码,按需求添加 扫码添加Beyonce小姐姐 扫码关注 进Java学习大礼包 机器学习KNN算法实践 预测城市空气质量 「P ...

  2. 机器学习算法_明确解释:4种机器学习算法

    您是涉足机器学习的数据科学家吗? 如果是,那么您应该阅读此内容. 定义,目的,流行算法和用例-全部说明 > Photo by Andy Kelly on Unsplash 机器学习已经从科幻小说 ...

  3. 【机器学习PAI实践十二】机器学习算法基于信用卡消费记录做信用评分

    背景 如果你是做互联网金融的,那么一定听说过评分卡.评分卡是信用风险评估领域常用的建模方法,评分卡并不简单对应于某一种机器学习算法,而是一种通用的建模框架,将原始数据通过分箱后进行特征工程变换,继而应 ...

  4. ML/DL之预测分析类:利用机器学习算法进行预测分析的简介、分析、代码实现之详细攻略

    ML/DL之预测分析类:利用机器学习算法进行预测分析的简介.分析.代码实现之详细攻略 目录 机器学习算法进行预测的简介 机器学习算法进行预测的分析 机器学习算法进行预测的代码实现 机器学习算法进行预测 ...

  5. 机器学习算法——决策树算法详细介绍,并使用sklearn实现案例预测,可视化决策树

    目录 一.决策树算法简介 二.决策树分类原理 1.熵 1.1 概念 1.2 案例 2.决策树的划分依据一:信息增益 2.1 概念 2.2 案例: 3.决策树的划分依据二:信息增益率 3.1 概念 3. ...

  6. 机器学习算法(七): 基于LightGBM的分类预测(基于英雄联盟10分钟数据判断红蓝方胜负)

    机器学习算法(七)基于LightGBM的分类预测 1. 实验室介绍 1.1 LightGBM的介绍 LightGBM是2017年由微软推出的可扩展机器学习系统,是微软旗下DMKT的一个开源项目,由20 ...

  7. 【数据分析】利用机器学习算法进行预测分析(一):移动平均(Moving Average)

    时间序列预测中的机器学习方法(一):移动平均(Moving Average) 1.背景介绍 如果可能的话,每个人都想成为先知,预测在未来会发生什么事.实际上,这种预测非常困难.试想某个人提前知晓了市场 ...

  8. 机器学习之MATLAB代码--SSA-CNN-BiLSTM做电池容量预测(十)

    机器学习之MATLAB代码--SSA-CNN-BiLSTM做电池容量预测(十) 代码 数据 结果 代码 代码按照如下顺序: 1. clc close all clear alldata=xlsread ...

  9. Briefings in Bioinformatics|南开大学药学院林建平教授|用于天然产物靶标预测的机器学习算法的大规模比较

    今天我们来学习南开大学药学院林建平教授(学科方向:计算生物学)2022年发表在Briefings in Bioinformatics上的新作"用于天然产物靶标预测的机器学习算法的大规模比较& ...

最新文章

  1. Windows 11 正式官宣:全新 UI、支持安卓 App、应用商店 0 抽成!
  2. eruda/vconsole 手机端调试利器
  3. Linux—vim常用命令
  4. html css 磁贴,使用JS配合CSS实现Windows Phone中的磁贴效果
  5. python把模块装到文件夹中_把模块有关联的放在一个文件夹中 在python2中调用文件夹名会直接失败 在python3中调用会成功,但是调用不能成功的解决方案...
  6. Android_L(64bit) 模拟器配置及创建项目
  7. 鬼泣最稳定的服务器,DNF95级版本国服环境下,鬼泣和红神谁更强?深度对比客观分析!...
  8. 你不知道的JS之作用域和闭包(三)函数 vs. 块级作用域
  9. 从零基础入门Tensorflow2.0 ----五、20. 预定义estimator使用
  10. NDK开发才是有精华和特色的部分
  11. Power Builder软件的下载安装
  12. C/C++ Linux 异步IO(AIO)
  13. 甘超波:NLP表象系统
  14. 数学杂谈:高维空间向量夹角小记
  15. 2020年,不可错过的技术圈十大“翻车”事件
  16. LEF和GDS匹配问题
  17. (二维树状数组)E - Stars
  18. 利用系统自带命令杀毒
  19. Win11的两个实用技巧系列之清理驱动器、设置虚拟内存
  20. 随机硬件虚拟机修改教程

热门文章

  1. 雷电模拟器重置开机密码
  2. 旅游类App的原型制作分享
  3. 旅行社旅游APP开发维护经验
  4. L2-003. 月饼
  5. Linux系统 ELK(8.3.1)单机环境搭建
  6. SVN报错Cleanup问题解决:Cleanup failed to process the following paths:Can‘t revert
  7. CodeForces 698C LRU
  8. 百分制转五分制(java)
  9. qmake -v,出现错误:qmake: could not exec ‘/usr/lib/x86_64-linux-gnu/qt4/bin/qmake‘: No such file or direc
  10. OSChina 周五乱弹 ——程序员会喜欢的 12 款键盘