案例二、蒸汽量预测

1. 基础信息

数据信息

  • 训练数据(train.txt)
  • 测试数据(test.txt)
  • 特征变量字段:V0-V37
  • 目标变量字段:target

目的:利用训练数据训练出模型,预测测试数据的目标变量

评价指标

均方误差MSE
Score=1n∑1n(yi−y∗)2Score = \frac{1}{n} \sum_1^n(y_i-y^*)^2 Score=n1​1∑n​(yi​−y∗)2

2. 加载数据

import numpy as np
import pandas as pd
data_train = pd.read_csv('train.txt', sep='\t')
data_test = pd.read_csv('test.txt', sep='\t')
data_train.head(1)
V0 V1 V2 V3 V4 V5 V6 V7 V8 V9 ... V29 V30 V31 V32 V33 V34 V35 V36 V37 target
0 0.566 0.016 -0.143 0.407 0.452 -0.901 -1.812 -2.36 -0.436 -2.114 ... 0.136 0.109 -0.615 0.327 -4.627 -4.789 -5.101 -2.608 -3.508 0.175

1 rows × 39 columns

data_test.head(1)
V0 V1 V2 V3 V4 V5 V6 V7 V8 V9 ... V28 V29 V30 V31 V32 V33 V34 V35 V36 V37
0 0.368 0.38 -0.225 -0.049 0.379 0.092 0.55 0.551 0.244 0.904 ... -0.449 0.047 0.057 -0.042 0.847 0.534 -0.009 -0.19 -0.567 0.388

1 rows × 38 columns

# 合并训练数据和测试数据
data_train['oringin'] = "train"
data_test['oringin'] = 'test'
data_all = pd.concat([data_train, data_test], axis=0, ignore_index=True)
data_all.head(2)
V0 V1 V2 V3 V4 V5 V6 V7 V8 V9 ... V30 V31 V32 V33 V34 V35 V36 V37 target oringin
0 0.566 0.016 -0.143 0.407 0.452 -0.901 -1.812 -2.36 -0.436 -2.114 ... 0.109 -0.615 0.327 -4.627 -4.789 -5.101 -2.608 -3.508 0.175 train
1 0.968 0.437 0.066 0.566 0.194 -0.893 -1.566 -2.36 0.332 -2.114 ... 0.124 0.032 0.600 -0.843 0.160 0.364 -0.335 -0.730 0.676 train

2 rows × 40 columns

3. 探索数据

3.1 查看特征数据分布

import seaborn as sns
import matplotlib.pyplot as plt
# 使用kdeplot(核密度估计图)估计未知的密度函数,查看数据样本本身的分布特征
for columns in data_all.columns[0:-2]:g = sns.kdeplot(data_all[columns][(data_all['oringin']=='train')], color='Red', shade=True)g = sns.kdeplot(data_all[columns][data_all['oringin']=='test'], color='Blue', ax=g, shade=True)g.set_xlabel(columns)g.set_ylabel('Frequency')g = g.legend(['train','test'])plt.show()










3.2 查看特征之间的相关性

# 拿到训练数据集,删除oringin, target字段,研究特征之前的关系
data_train1 = data_all[data_all['oringin']=='train'].drop(['oringin'],axis=1)# 绘图
plt.figure(figsize=(20,16))
# 表头
col = data_train1.columns.tolist()
# 相关系数矩阵
mcorr = data_train1[col].corr(method='spearman')
# 构建与mcorr同维数矩阵
mask = np.zeros_like(mcorr, dtype=np.bool)
# 右侧角分线,np.triu_indices_from() 返回方阵的上三角矩阵的索引
mask[np.triu_indices_from(mask)] = True
# 返回matplotlib colormap对象,调色板
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# 绘制热力图
g = sns.heatmap(mcorr, mask=mask, cmap=cmap, square=True, annot=True, fmt='0.2f')
plt.show()

进行降维操作,将相关性的绝对值小于阈值的特征进行删除

threshold = 0.1
corr_matrix = data_train1.corr(method='spearman').abs()
drop_col = corr_matrix[corr_matrix['target']<threshold].index
data_all.drop(drop_col, axis=1, inplace=True)

进行归一化操作

from sklearn.preprocessing import minmax_scale
cols_numeric = list(data_all.columns)
cols_numeric.remove('oringin')
def scale_minmax(col):return (col-col.min())/(col.max()-col.min())scale_cols = [col for col in cols_numeric if col!='target']
data_all[scale_cols] = data_all[scale_cols].apply(scale_minmax, axis=0)
data_all[scale_cols].describe()
V0 V1 V2 V3 V4 V5 V6 V7 V8 V10 ... V19 V20 V22 V24 V27 V29 V30 V31 V36 V37
count 4813.000000 4813.000000 4813.000000 4813.000000 4813.000000 4813.000000 4813.000000 4813.000000 4813.000000 4813.000000 ... 4813.000000 4813.000000 4813.000000 4813.000000 4813.000000 4813.000000 4813.000000 4813.000000 4813.000000 4813.000000
mean 0.694172 0.721357 0.602300 0.603139 0.523743 0.407246 0.748823 0.745740 0.715607 0.348518 ... 0.519158 0.456147 0.409789 0.356712 0.881401 0.388683 0.589459 0.792709 0.332385 0.545795
std 0.144198 0.131443 0.140628 0.152462 0.106430 0.186636 0.132560 0.132577 0.118105 0.134882 ... 0.140166 0.134083 0.163525 0.265512 0.128221 0.133475 0.130786 0.102976 0.127456 0.150356
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.626676 0.679416 0.514414 0.503888 0.478182 0.298432 0.683324 0.696938 0.664934 0.284327 ... 0.414436 0.370475 0.276206 0.040616 0.888575 0.292445 0.550092 0.761816 0.270584 0.445647
50% 0.729488 0.752497 0.617072 0.614270 0.535866 0.382419 0.774125 0.771974 0.742884 0.366469 ... 0.540294 0.447305 0.431562 0.381736 0.916015 0.375734 0.594428 0.815055 0.347056 0.539317
75% 0.790195 0.799553 0.700464 0.710474 0.585036 0.460246 0.842259 0.836405 0.790835 0.432965 ... 0.623125 0.522660 0.493213 0.574728 0.932555 0.471837 0.650798 0.852229 0.414861 0.643061
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

8 rows × 26 columns

4. 特征工程

import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
import warnings
warnings.filterwarnings("ignore")
fcols = 6
frows = len(cols_numeric)-1
plt.figure(figsize=(4*fcols,4*frows))
i=0for var in cols_numeric:if var!='target':dat = data_all[[var, 'target']].dropna()i+=1plt.subplot(frows,fcols,i)sns.distplot(dat[var] , fit=stats.norm);plt.title(var+' Original')plt.xlabel('')i+=1plt.subplot(frows,fcols,i)_=stats.probplot(dat[var], plot=plt)plt.title('skew='+'{:.4f}'.format(stats.skew(dat[var])))plt.xlabel('')plt.ylabel('')i+=1plt.subplot(frows,fcols,i)plt.plot(dat[var], dat['target'],'.',alpha=0.5)plt.title('corr='+'{:.2f}'.format(np.corrcoef(dat[var], dat['target'])[0][1]))i+=1plt.subplot(frows,fcols,i)trans_var, lambda_var = stats.boxcox(dat[var].dropna()+1)trans_var = scale_minmax(trans_var)      sns.distplot(trans_var , fit=stats.norm);plt.title(var+' Tramsformed')plt.xlabel('')i+=1plt.subplot(frows,fcols,i)_=stats.probplot(trans_var, plot=plt)plt.title('skew='+'{:.4f}'.format(stats.skew(trans_var)))plt.xlabel('')plt.ylabel('')i+=1plt.subplot(frows,fcols,i)plt.plot(trans_var, dat['target'],'.',alpha=0.5)plt.title('corr='+'{:.2f}'.format(np.corrcoef(trans_var,dat['target'])[0][1]))

# 进行Box-Cox变换
cols_transform=data_all.columns[0:-2]
for col in cols_transform:   # transform columndata_all.loc[:,col], _ = stats.boxcox(data_all.loc[:,col]+1)
print(data_all.target.describe())
plt.figure(figsize=(12,4))
plt.subplot(1,2,1)
sns.distplot(data_all.target.dropna() , fit=stats.norm);
plt.subplot(1,2,2)
_=stats.probplot(data_all.target.dropna(), plot=plt)
count    2888.000000
mean        0.126353
std         0.983966
min        -3.044000
25%        -0.350250
50%         0.313000
75%         0.793250
max         2.538000
Name: target, dtype: float64

sp = data_train.target
data_train.target1 =np.power(1.5,sp)
print(data_train.target1.describe())plt.figure(figsize=(12,4))
plt.subplot(1,2,1)
sns.distplot(data_train.target1.dropna(),fit=stats.norm);
plt.subplot(1,2,2)
_=stats.probplot(data_train.target1.dropna(), plot=plt)
count    2888.000000
mean        1.129957
std         0.394110
min         0.291057
25%         0.867609
50%         1.135315
75%         1.379382
max         2.798463
Name: target, dtype: float64

5. 模型构建

构建训练集和测试集

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, RepeatedKFold, cross_val_score,cross_val_predict,KFold
from sklearn.metrics import make_scorer,mean_squared_error
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.svm import LinearSVR, SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor,AdaBoostRegressor
from xgboost import XGBRegressor
from sklearn.preprocessing import PolynomialFeatures,MinMaxScaler,StandardScaler
def get_training_data():# extract training samplesfrom sklearn.model_selection import train_test_splitdf_train = data_all[data_all["oringin"]=="train"]df_train["label"]=data_train.target1# split SalePrice and featuresy = df_train.targetX = df_train.drop(["oringin","target","label"],axis=1)X_train,X_valid,y_train,y_valid=train_test_split(X,y,test_size=0.3,random_state=100)return X_train,X_valid,y_train,y_valid# extract test data (without SalePrice)
def get_test_data():df_test = data_all[data_all["oringin"]=="test"].reset_index(drop=True)return df_test.drop(["oringin","target"],axis=1)

RMSE、MSE评价函数

from sklearn.metrics import make_scorerdef rmse(y_true, y_pred):diff = y_pred - y_truesum_sq = sum(diff**2)    n = len(y_pred)   return np.sqrt(sum_sq/n)def mse(y_ture,y_pred):return mean_squared_error(y_ture,y_pred)rmse_scorer = make_scorer(rmse, greater_is_better=False) mse_scorer = make_scorer(mse, greater_is_better=False)
#输入的score_func为记分函数时,该值为True(默认值);输入函数为损失函数时,该值为False

寻找离群值,并删除

def find_outliers(model, X, y, sigma=3):# predict y values using modelmodel.fit(X,y)y_pred = pd.Series(model.predict(X), index=y.index)# calculate residuals between the model prediction and true y valuesresid = y - y_predmean_resid = resid.mean()std_resid = resid.std()# calculate z statistic, define outliers to be where |z|>sigmaz = (resid - mean_resid)/std_resid    outliers = z[abs(z)>sigma].index# print and plot the resultsprint('R2=',model.score(X,y))print('rmse=',rmse(y, y_pred))print("mse=",mean_squared_error(y,y_pred))print('---------------------------------------')print('mean of residuals:',mean_resid)print('std of residuals:',std_resid)print('---------------------------------------')print(len(outliers),'outliers:')print(outliers.tolist())plt.figure(figsize=(15,5))ax_131 = plt.subplot(1,3,1)plt.plot(y,y_pred,'.')plt.plot(y.loc[outliers],y_pred.loc[outliers],'ro')plt.legend(['Accepted','Outlier'])plt.xlabel('y')plt.ylabel('y_pred');ax_132=plt.subplot(1,3,2)plt.plot(y,y-y_pred,'.')plt.plot(y.loc[outliers],y.loc[outliers]-y_pred.loc[outliers],'ro')plt.legend(['Accepted','Outlier'])plt.xlabel('y')plt.ylabel('y - y_pred');ax_133=plt.subplot(1,3,3)z.plot.hist(bins=50,ax=ax_133)z.loc[outliers].plot.hist(color='r',bins=50,ax=ax_133)plt.legend(['Accepted','Outlier'])plt.xlabel('z')return outliers
X_train, X_valid,y_train,y_valid = get_training_data()
test=get_test_data()outliers = find_outliers(Ridge(), X_train, y_train)
X_outliers=X_train.loc[outliers]
y_outliers=y_train.loc[outliers]
X_t=X_train.drop(outliers)
y_t=y_train.drop(outliers)
R2= 0.8782328985818861
rmse= 0.34678913877896644
mse= 0.12026270677505727
---------------------------------------
mean of residuals: -3.541067598581771e-16
std of residuals: 0.34687496705370685
---------------------------------------
25 outliers:
[2655, 1164, 2863, 1145, 2697, 2528, 1882, 2645, 691, 875, 1085, 1905, 1874, 2647, 2625, 884, 2696, 2668, 1310, 1901, 1458, 2769, 2002, 2669, 1972]

模型训练

def get_trainning_data_omitoutliers():#获取训练数据省略异常值y=y_t.copy()X=X_t.copy()return X,y
def train_model(model, param_grid=[], X=[], y=[], splits=5, repeats=5):# 获取数据if len(y)==0:X,y = get_trainning_data_omitoutliers()# 交叉验证rkfold = RepeatedKFold(n_splits=splits, n_repeats=repeats)# 网格搜索最佳参数if len(param_grid)>0:gsearch = GridSearchCV(model, param_grid, cv=rkfold,scoring="neg_mean_squared_error",verbose=1, return_train_score=True)# 训练gsearch.fit(X,y)# 最好的模型model = gsearch.best_estimator_        best_idx = gsearch.best_index_# 获取交叉验证评价指标grid_results = pd.DataFrame(gsearch.cv_results_)cv_mean = abs(grid_results.loc[best_idx,'mean_test_score'])cv_std = grid_results.loc[best_idx,'std_test_score']# 没有网格搜索  else:grid_results = []cv_results = cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=rkfold)cv_mean = abs(np.mean(cv_results))cv_std = np.std(cv_results)# 合并数据cv_score = pd.Series({'mean':cv_mean,'std':cv_std})# 预测y_pred = model.predict(X)# 模型性能的统计数据        print('----------------------')print(model)print('----------------------')print('score=',model.score(X,y))print('rmse=',rmse(y, y_pred))print('mse=',mse(y, y_pred))print('cross_val: mean=',cv_mean,', std=',cv_std)# 残差分析与可视化y_pred = pd.Series(y_pred,index=y.index)resid = y - y_predmean_resid = resid.mean()std_resid = resid.std()z = (resid - mean_resid)/std_resid    n_outliers = sum(abs(z)>3)outliers = z[abs(z)>3].indexreturn model, cv_score, grid_results
# 定义训练变量存储数据
opt_models = dict()
score_models = pd.DataFrame(columns=['mean','std'])
splits=5
repeats=5
model = 'Ridge'  #可替换,见案例分析一的各种模型
opt_models[model] = Ridge() #可替换,见案例分析一的各种模型
alph_range = np.arange(0.25,6,0.25)
param_grid = {'alpha': alph_range}opt_models[model],cv_score,grid_results = train_model(opt_models[model], param_grid=param_grid, splits=splits, repeats=repeats)cv_score.name = model
score_models = score_models.append(cv_score)plt.figure()
plt.errorbar(alph_range, abs(grid_results['mean_test_score']),abs(grid_results['std_test_score'])/np.sqrt(splits*repeats))
plt.xlabel('alpha')
plt.ylabel('score')
Fitting 25 folds for each of 23 candidates, totalling 575 fits[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.----------------------
Ridge(alpha=0.25)
----------------------
score= 0.8953736437747573
rmse= 0.31954612469809013
mse= 0.1021097258095675
cross_val: mean= 0.10592084354573693 , std= 0.006784203608364568[Parallel(n_jobs=1)]: Done 575 out of 575 | elapsed:    4.2s finishedText(0, 0.5, 'score')

# 预测函数
def model_predict(test_data,test_y=[]):i=0y_predict_total=np.zeros((test_data.shape[0],))for model in opt_models.keys():if model!="LinearSVR" and model!="KNeighbors":y_predict=opt_models[model].predict(test_data)y_predict_total+=y_predicti+=1if len(test_y)>0:print("{}_mse:".format(model),mean_squared_error(y_predict,test_y))y_predict_mean=np.round(y_predict_total/i,6)if len(test_y)>0:print("mean_mse:",mean_squared_error(y_predict_mean,test_y))else:y_predict_mean=pd.Series(y_predict_mean)return y_predict_mean

模型的预测及结果保存

y_ = model_predict(test)
y_.to_csv('predict.txt',header = None,index = False)

Day15-集成学习-机器学习-案例二:蒸汽量预测(DataWhale)相关推荐

  1. (十五)集成学习(下)——蒸汽量预测

    参考:DataWhale教程链接 集成学习(上)所有Task: (一)集成学习上--机器学习三大任务 (二)集成学习上--回归模型 (三)集成学习上--偏差与方差 (四)集成学习上--回归模型评估与超 ...

  2. 集成学习-蒸汽量预测案例

    本文参考资料:https://github.com/datawhalechina/team-learning-data-mining/tree/master/EnsembleLearning 集成学习 ...

  3. DataWhale集成学习Task15 集成学习案例二 (蒸汽量预测)

    集成学习案例二 (蒸汽量预测) 文章目录 集成学习案例二 (蒸汽量预测) 1 整体思路 1.1 整体步骤 1.2 评价指标 2 实战演练 导入package 加载数据 探索数据分布 特征工程 模型构建 ...

  4. 【集成学习(案例)】蒸汽量预测

    集成学习案例二 (蒸汽量预测) 背景介绍 火力发电的基本原理是:燃料在燃烧时加热水生成蒸汽,蒸汽压力推动汽轮机旋转,然后汽轮机带动发电机旋转,产生电能.在这一系列的能量转化中,影响发电效率的核心是锅炉 ...

  5. 【集成学习(下)】Task15 集成学习-案例 蒸汽量预测

    文章目录 集成学习案例二 (蒸汽量预测) 背景介绍 数据信息 评价指标 导入package 加载数据 探索数据分布 小小个人总结 特征工程 模型构建以及集成学习 进行模型的预测以及结果的保存 参考 集 ...

  6. 机器学习(下)-案例分析2 :蒸汽量预测

    案例二 :蒸汽量预测 背景介绍 火力发电的基本原理:燃料在燃烧时加热水生成蒸汽,蒸汽压力推动汽轮机旋转,然后汽轮机带动发电机旋转,产生电能.在这一系列的能量转化中,影响发电效率的核心是锅炉的燃烧效率, ...

  7. 集成学习-task8-案例二

    集成学习案例二 (蒸汽量预测) 背景介绍 火力发电的基本原理是:燃料在燃烧时加热水生成蒸汽,蒸汽压力推动汽轮机旋转,然后汽轮机带动发电机旋转,产生电能.在这一系列的能量转化中,影响发电效率的核心是锅炉 ...

  8. 机器篇——集成学习(八) 细说 ball49_pred 项目(彩票预测)

    返回主目录 返回集成学习目录 上一章:机器篇--集成学习(七) 细说 XGBoost 算法 下一章:机器篇--集成学习(九) 细说 hotel_pred 项目(酒店预测) 本小节,细说 ball49_ ...

  9. 机器篇——集成学习(九) 细说 hotel_pred 项目(酒店预测)

    返回主目录 返回集成学习目录 上一章:机器篇--集成学习(八) 细说 ball49_pred 项目(彩票预测) 本小节,细说 hotel_pred 项目(酒店预测) 三. 项目解说 9. hotel_ ...

最新文章

  1. 【转】Word2007中不连续页码设置 多种页码设置
  2. 抽象工厂模式java_Java之抽象工厂模式(Abstract Factory)
  3. python 换行符的识别问题,Unix 和Windows 中是不一样的
  4. centos7下tomcat7 或tomcat8启动超慢原因
  5. java的3个初始化_通过实例解析Java类初始化和实例初始化
  6. 解决微服务在docker上部署后无法连接数据库的问题
  7. SSL 1461——最大连续数列的和
  8. java 添加注解_你知道Java中的package-info的作用吗?
  9. 你真的会玩SQL吗?玩爆你的数据报表之存储过程编写(上)
  10. eclipse工程运行正常但是工程有红叉的问题
  11. android4.4 ssl版本查看,OkHttp在4.4及以下不支持TLS协议的解决方法
  12. SAP License:定义在制品和结果分析过账OKG8
  13. export和export default的区别 1
  14. Python包的相对导入时出现错误的解决方法
  15. 《软件工程导论第6版》--张海藩 牟永敏 课后答案及其详解 第7章 实现
  16. 关于心理的二十五种倾向(查理·芒格)-1
  17. java算术表达式_一文了解如何用 Java 进行算术表达式计算
  18. 「Hortic Res」APETALA2的同源物CaFFN可调节辣椒的开花时间
  19. 深度学习之CNN卷积神经网络
  20. 你知道上海社保缴费基数吗?上海各类人员的社保缴费基数

热门文章

  1. 上海市黄浦区税务系统代理服务器,上海黄浦区国税局/地税局/税务所/办税服务厅地址及电话...
  2. C#学习笔记:控件BackColor属性与ForeColor的使用方法
  3. PUBG Lite安装方法
  4. OPC包深度解析及工业防火墙 (上)
  5. PaperTigerOS(第3天)
  6. JDK自带工具keytool生成ssl证书
  7. C++考试习题之考试预测题
  8. MES积分管理:通过积分激励岗位从业者
  9. Java项目生成电脑桌面快捷脚本(Redis数据)
  10. [笔试补完计划]澜起科技2022数字验证笔试