上一篇《天池学习赛:工业蒸汽量预测3——模型训练》中已经是使用了几种机器学习的模型,接下来将介绍一些模型的评价方法。

目录

  • 1 模型评估的方法
  • 2 模型调参
  • 3 赛题模型验证与调参
    • 3.1 模型过拟合与欠拟合
    • 3.2 模型正则化
    • 3.3 模型交叉验证
    • 3.4 模型超参空间及调参
    • 3.5 学习曲线和验证曲线

1 模型评估的方法

1 欠拟合与过拟合

2 泛化与正则化

3 回归模型评价指标与调用方法

(1)平均绝对误差

from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test,y_pred)

(2)均方误差

from sklearn.metrics import mean_squared_error
mean_squared_error(y_test,y_pred)

(3)均方根误差

from sklearn.metrics import mean_squared_error
Pred_Error=mean_squared_error(y_test,y_pred)
Sqrt(Pred_Error)

(4)R平方值

from sklearn.metrics import r2_score
r2_score(y_test,y_pred)

4 交叉验证
(1)简单交叉验证:

from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(iris.data,iris.target,test_size=.4,random_state=0)

(2)K折交叉验证:

from sklearn.model_selection import KFold
kf=KFold(n_splits=10)

(3)留一法交叉验证:

from sklearn.model_selection import LeaveOneOut
l0o=LeaveOneOut()

(4)留P法交叉验证:

from sklearn.model_selection import LeavePOut
lpo=LeavePOut(p=5)

2 模型调参

1 调参

2 网格搜索
对所有可能的参数组合,依次进行训练,最后找出最好的参数组合

from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import train_test_splitiris=load_iris()
X_train,X_test,Y_train,Y_test=train_test_split(iris.data,iris.target,random_state=0)
print("size of training set:{} size of teating set:{}".format(X_train[0],X_test[0]))best_score=0
for gamma in[0.001,0.01,0.1,1,10,100]:for c in[0.001,0.01,0.1,1,10,100]:svm=SVC(gamma=gamma,C=c)svm.fit(X_train,Y_train)score=svm.score(X_test,Y_test)if score >best_score:best_score=scorebest_parameters={'gamma':gamma,'C':c}print("Best score:{:.2f}",format(best_score))
print('Best parameters:{}'.format(best_parameters))

可以从输出结果看出gamma=0.001与C=100的组合最好,得分最高。

size of training set:[5.9 3. 4.2 1.5] size of teating set:[5.8 2.8 5.1 2.4]
Best score:0.97
Best parameters:{‘gamma’: 0.001, ‘C’: 100}

3 学习曲线

3 赛题模型验证与调参

3.1 模型过拟合与欠拟合

1 基础代码

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn import preprocessing
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.decomposition import PCA
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split #切分数据
from sklearn.metrics import mean_squared_error  #评价指标
from sklearn.linear_model import LinearRegression   #从sklearn引入线性模型
from sklearn.neighbors import KNeighborsRegressor   #k近邻回归模型
from sklearn.tree import DecisionTreeRegressor     #决策树回归模型
from sklearn.ensemble import RandomForestRegressor    #随机森林回归模型
from lightgbm import LGBMRegressor   #LightGBM回归模型
from sklearn.svm import SVR    #支持向量机
from sklearn.linear_model import SGDRegressor#读数据
train_data_file = "./zhengqi_train.txt"
test_data_file = "./zhengqi_test.txt"
train_data=pd.read_csv(train_data_file,sep='\t',encoding='utf-8')
test_data=pd.read_csv(test_data_file,sep='\t',encoding='utf-8')#归一化
features_columns=[col for col in train_data.columns if col not in ['target']]
min_max_scaler=preprocessing.MinMaxScaler()
min_max_scaler=min_max_scaler.fit(train_data[features_columns])
train_data_scaler=min_max_scaler.transform(train_data[features_columns])
test_data_scaler=min_max_scaler.transform(test_data[features_columns])
train_data_scaler=pd.DataFrame(train_data_scaler)
train_data_scaler.columns=features_columns
test_data_scaler=pd.DataFrame(test_data_scaler)
test_data_scaler.columns=features_columns
train_data_scaler['target']=train_data['target']#PCA降维
pca=PCA(n_components=0.9)
new_train_pca_16=pca.fit_transform(train_data_scaler.iloc[:,0:-1])
new_test_pca_16=pca.transform(test_data_scaler)
new_train_pca_16=pd.DataFrame(new_train_pca_16)
new_test_pca_16=pd.DataFrame(new_test_pca_16)
new_train_pca_16['target']=train_data_scaler['target']#采用PCA保留的16维特征的数据
new_train_pca_16=new_train_pca_16.fillna(0)
train=new_train_pca_16[new_train_pca_16.columns]
target=new_train_pca_16['target']#划分数据集   训练集80%验证机20%
train_data,test_data,train_target,test_target=train_test_split(train,target,\test_size=0.2,random_state=0)

2 欠拟合

clf=SGDRegressor(max_iter=500,tol=1e-2)
clf.fit(train_data,train_target)
score_train=mean_squared_error(train_target,clf.predict(train_data))
score_test=mean_squared_error(test_target,clf.predict(test_data))
print("SGDRegressor train MSE:",score_train)
print("SGDRegressor test MSE:",score_test)

3 过拟合

from sklearn.preprocessing import PolynomialFeatures
poly=PolynomialFeatures(5)
train_data_poly=poly.fit_transform(train_data)
test_data_poly=poly.fit_transform(test_data)
clf=SGDRegressor(max_iter=1000,tol=1e-3)
clf.fit(train_data_poly,train_target)
score_train=mean_squared_error(train_target,clf.predict(train_data_poly))
score_test=mean_squared_error(test_target,clf.predict(test_data_poly))
print("SGDRegressor train MSE:",score_train)
print("SGDRegressor test MSE:",score_test)

4 正常拟合

from sklearn.preprocessing import PolynomialFeatures
poly=PolynomialFeatures(3)
train_data_poly=poly.fit_transform(train_data)
test_data_poly=poly.fit_transform(test_data)
clf=SGDRegressor(max_iter=1000,tol=1e-3)
clf.fit(train_data_poly,train_target)
score_train=mean_squared_error(train_target,clf.predict(train_data_poly))
score_test=mean_squared_error(test_target,clf.predict(test_data_poly))
print("SGDRegressor train MSE:",score_train)
print("SGDRegressor test MSE:",score_test)

3.2 模型正则化

1 L2正则化

在上面正常拟合的代码中加入正则化项

from sklearn.preprocessing import PolynomialFeatures
poly=PolynomialFeatures(3)
train_data_poly=poly.fit_transform(train_data)
test_data_poly=poly.fit_transform(test_data)
clf=SGDRegressor(max_iter=1000,tol=1e-3,penalty='L2',alpha=0.0001)
clf.fit(train_data_poly,train_target)
score_train=mean_squared_error(train_target,clf.predict(train_data_poly))
score_test=mean_squared_error(test_target,clf.predict(test_data_poly))
print("SGDRegressor train MSE:",score_train)
print("SGDRegressor test MSE:",score_test)

2 L1正则化

同上

from sklearn.preprocessing import PolynomialFeatures
poly=PolynomialFeatures(3)
train_data_poly=poly.fit_transform(train_data)
test_data_poly=poly.fit_transform(test_data)
clf=SGDRegressor(max_iter=1000,tol=1e-3,penalty='L1',alpha=0.0001)
clf.fit(train_data_poly,train_target)
score_train=mean_squared_error(train_target,clf.predict(train_data_poly))
score_test=mean_squared_error(test_target,clf.predict(test_data_poly))
print("SGDRegressor train MSE:",score_train)
print("SGDRegressor test MSE:",score_test)

3 ElasticNet联合L1和L2范数加权正则化

from sklearn.preprocessing import PolynomialFeatures
poly=PolynomialFeatures(3)
train_data_poly=poly.fit_transform(train_data)
test_data_poly=poly.fit_transform(test_data)
clf=SGDRegressor(max_iter=1000,tol=1e-3,penalty='elasticnet',l1_ratio=0.9,alpha=.0001)
clf.fit(train_data_poly,train_target)
score_train=mean_squared_error(train_target,clf.predict(train_data_poly))
score_test=mean_squared_error(test_target,clf.predict(test_data_poly))
print("SGDRegressor train MSE:",score_train)
print("SGDRegressor test MSE:",score_test)

3.3 模型交叉验证

1 简单交叉

2 K折交叉

from sklearn.model_selection import KFold
kf=KFold(n_splits=5)
for k,(train_index,test_index) in enumerate(kf.split(train)):train_data,test_data,train_target,test_target=train.values[train_index],\train.values[test_index],\target[train_index],\target[test_index]clf = SGDRegressor(max_iter=1000, tol=1e-3)clf.fit(train_data, train_target)score_train = mean_squared_error(train_target, clf.predict(train_data))score_test = mean_squared_error(test_target, clf.predict(test_data))print(k,"折","SGDRegressor train MSE:", score_train)print(k,"折","SGDRegressor test MSE:", score_test,"\n")

3.留一法交叉

from sklearn.model_selection import LeaveOneOut
loo=LeaveOneOut()
num=100
for k,(train_index,test_index) in enumerate(loo.split(train)):train_data, test_data, train_target, test_target = train.values[train_index], \train.values[test_index], \target[train_index], \target[test_index]clf = SGDRegressor(max_iter=1000, tol=1e-3)clf.fit(train_data, train_target)score_train = mean_squared_error(train_target, clf.predict(train_data))score_test = mean_squared_error(test_target, clf.predict(test_data))print(k, "个", "SGDRegressor train MSE:", score_train)print(k, "个", "SGDRegressor test MSE:", score_test, "\n")if k>=9:break

4 留P法交叉

from sklearn.model_selection import LeavePOut
lpo=LeavePOut(p=10)
num=100
for k,(train_index,test_index) in enumerate(lpo.split(train)):train_data, test_data, train_target, test_target = train.values[train_index], \train.values[test_index], \target[train_index], \target[test_index]clf = SGDRegressor(max_iter=1000, tol=1e-3)clf.fit(train_data, train_target)score_train = mean_squared_error(train_target, clf.predict(train_data))score_test = mean_squared_error(test_target, clf.predict(test_data))print(k, "10个", "SGDRegressor train MSE:", score_train)print(k, "10个", "SGDRegressor test MSE:", score_test, "\n")if k >= 9:break

3.4 模型超参空间及调参

1 穷举网格搜索

#使用数据训练随机森林模型,采用穷举网格搜索方法调参
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressorRandomForestRegressor=RandomForestRegressor()
parameters={'n_estimators':[50,100,200],'max_depth':[1,2,3]}
clf=GridSearchCV(RandomForestRegressor,parameters,cv=5)
clf.fit(train_data,train_target)
score_test=mean_squared_error(test_target,clf.predict(test_data))
print("RandomForestRegressor GridSearchCV test MSE:",score_test)
print(sorted(clf.cv_results_.keys()))    #包含训练时间和验证指标的一些信息

2 随机参数优化

#使用数据训练随机森林模型,采用随即参数优化方法调参
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
RandomForestRegressor=RandomForestRegressor()
parameters={'n_estimators':[50,100,200,300],'max_depth':[1,2,3,4,5]
}
clf=RandomizedSearchCV(RandomForestRegressor,parameters,cv=5)
clf.fit(train_data,train_target)
print('Best parameters found are:',clf.best_params_)
score_test=mean_squared_error(test_target,clf.predict(test_data))
print("RandomForestRegressor RandomizedSearchCV GridSearchCV test MSE:",score_test)
print(sorted(clf.cv_results_.keys()))   #包含训练时间和验证指标的一些信息

3 LGB调参

import lightgbm as lgbclf=lgb.LGBMRegressor(num_leaves=31)
parameters={'learning_rate':[0.01,0.1,1],'n_estimators':[20,40]}
clf.fit(train_data,train_target)
score_test=mean_squared_error(test_target,clf.predict(test_data))
print("LGBMRegressor GridSearchCV test MSE:",score_test)

3.5 学习曲线和验证曲线

1 学习曲线

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import ShuffleSplit
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import learning_curvetrain_data_file = "./zhengqi_train.txt"
test_data_file = "./zhengqi_test.txt"
train_data=pd.read_csv(train_data_file,sep='\t',encoding='utf-8')
test_data=pd.read_csv(test_data_file,sep='\t',encoding='utf-8')plt.figure(figsize=(18,10),dpi=150)def plot_learning_curve(estimator,title,x,y,ylim=None,cv=None,n_jobs=1,train_sizes=np.linspace(.1,1.0,5)):plt.figure()plt.title(title)if ylim is not None:plt.ylim(*ylim)plt.xlabel("Training examples")plt.ylabel("Score")train_sizes,train_scores,test_scores=learning_curve(estimator,x,y,cv=cv,n_jobs=n_jobs,train_sizes=train_sizes)train_scores_mean=np.mean(train_scores,axis=1)train_scores_std=np.std(train_scores,axis=1)test_scores_mean=np.mean(test_scores,axis=1)test_scores_std=np.std(test_scores,axis=1)plt.grid()plt.fill_between(train_sizes,train_scores_mean-train_scores_std,train_scores_mean+train_scores_std,alpha=0.1,color='r')plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1,color='g')plt.plot(train_sizes,train_scores_mean,'o-',color='r',label="training score")plt.plot(train_sizes,test_scores_mean,'o-',color='g',label="corss-validation score")plt.legend(loc="best")return plt
x=train_data[test_data.columns].values
y=train_data['target'].values
title="LinearRegression"
cv=ShuffleSplit(n_splits=100,test_size=0.2,random_state=0)
estimator=SGDRegressor()
plot_learning_curve(estimator,title,x,y,ylim=(0.7,1.01),cv=cv,n_jobs=-1).show()

2 验证曲线

绘制数据训练SGDRegressor模型的曲线

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import validation_curvetrain_data_file = "./zhengqi_train.txt"
test_data_file = "./zhengqi_test.txt"
train_data=pd.read_csv(train_data_file,sep='\t',encoding='utf-8')
test_data=pd.read_csv(test_data_file,sep='\t',encoding='utf-8')
x=train_data[test_data.columns].values
y=train_data['target'].valuesparam_range=[0.1,0.01,0.001,0.0001,0.00001,0.000001]
train_scores,test_scores=validation_curve(SGDRegressor(max_iter=1000,tol=1e-3,penalty='L1')\,x,y,param_name='alpha',param_range=param_range,\cv=10,scoring='r2',n_jobs=1)
train_scores_mean=np.mean(train_scores,axis=1)
train_scores_std=np.std(train_scores,axis=1)
test_scores_mean=np.mean(test_scores,axis=1)
test_scores_std=np.std(test_scores,axis=1)
plt.title("Validation Curve with SCDRegressor")
plt.xlabel("alpha")
plt.ylabel("Score")
plt.ylim(0.0,1.1)
plt.semilogx(param_range,train_scores_mean,label="Training scores",color='r')
plt.fill_between(param_range,train_scores_mean-train_scores_std,train_scores_mean+train_scores_std,alpha=0.2,color='r')
plt.semilogx(param_range,test_scores_mean,label="Sross_validation score",color='g')
plt.fill_between(param_range,train_scores_mean-train_scores_std,train_scores_mean+train_scores_std,alpha=0.2,color='g')
plt.legend(loc="best")
plt.show()

下一篇《天池学习赛:工业蒸汽量预测5——特征优化》

天池学习赛:工业蒸汽量预测4——模型验证相关推荐

  1. 天池学习赛:工业蒸汽量预测3——模型训练

    接上一篇<天池学习赛:工业蒸汽量预测2--特征工程> 数据划分: from sklearn.model_selection import train_test_split #切分数据new ...

  2. 天池大赛之工业蒸汽量预测(有史以来最全面)

    目录 1.导包与数据挖掘 1.1导包 1.2 数据载入 1.3 数据合并 1.4 数据分布 1.5 特征清洗 1.6 特征可视化 1.7 相关性系数 1.8 归一化 1.9 Box-Cox变换对连续变 ...

  3. 天池比赛:工业蒸汽量预测

    https://tianchi.aliyun.com/competition/entrance/231693/introduction 偶然看到一句话:最重要的是提特征,特征决定上限,模型只是无限逼近 ...

  4. 天池学习赛:工业蒸汽量预测5——特征优化

    上一篇<天池学习赛:工业蒸汽量预测4--模型验证> 目录 1 特征优化的方法 1.1 合成特征 1.2 特征变换 1.3 用决策树创造新特征 1.4 特征组合 2 赛题特征优化代码 1 特 ...

  5. 天池学习赛:工业蒸汽量预测2——特征工程

    上一篇<天池学习赛:工业蒸汽量预测1--数据探索> 目录 1.特征工程 1.1 预处理 1.2 特征处理 1.3 特征降维 1.3.1 特征选择 1.3.2 线性降维 2.赛题代码 3 结 ...

  6. 天池工业蒸汽量预测-模型调参

    本文改编自<阿里云天池大赛赛题解析-机器学习篇>工业蒸汽量预测的模型调参.进行了部分素材的替换和知识点的归纳总结.新增了Datawhale8月集成学习中的网格搜索.随机搜索的内容 上一篇工 ...

  7. python建模大赛算法_Python数据分析kaggle-Titanic+天池-工业蒸汽量预测建模算法

    做数据分析许久了, 简单写写比赛的数据分析项目思路 一 使用逻辑回归/随机森林等对kaggle比赛项目 "给出泰坦尼克号上的乘客的信息, 预测乘客是否幸存"进行简单的数据分析过程, ...

  8. 天池学习赛:工业蒸汽量预测1——数据探索

    目录 0.赛题介绍 1.数据分析知识 2.代码实现 0.赛题介绍 火力发电的基本原理是:燃料在燃烧时加热水生成蒸汽,蒸汽压力推动汽轮机旋转,然后汽轮机带动发电机旋转,产生电能.在这一系列的能量转化中, ...

  9. 【机器学习】阿里云天池竞赛——工业蒸汽量预测(2)

    机器学习经典赛题:工业蒸汽量预测(2) 机器学习经典赛题:工业蒸汽量预测(2) 3.1 特征工程的重要性和处理 3.2 数据预处理和特征处理 3.2.1 数据预处理 3.2.2 特征处理 3.3 特征 ...

最新文章

  1. Apache模块开发helloworld无错版
  2. containerd镜像导入import和push命令 gzip压缩解压缩命令
  3. Linux的僵尸进程产生原因及解决方法
  4. 蓝桥杯2014年省赛C/C++ 本科B组
  5. 解决window.location.href 下载文件时,一次点击产生两次下载+页面跳转问题
  6. Vue源码后记-vFor列表渲染(3)
  7. 自学python-python自学难吗
  8. 如何在HTML中加载一个CSS文件?
  9. jeDate日期控件
  10. 右脑图像记忆法原理和方法入门
  11. 射灯安装方法图解_射灯如何安装—射灯的安装方法介绍
  12. ios 描述文件位置
  13. 如何使用微信小程序制作banner轮播图?
  14. 【GAOPS050】自同步加扰和帧同步加扰
  15. 操作系统——linux
  16. PHP 创建推广海报
  17. 猴年快乐! 在UI设计中解密农历新年的象征意义
  18. 一次云服务器购买经历,给个人建站的小白做些参考
  19. Safari 浏览器 16.0 发布(含独立安装包下载)
  20. Docker 的官方 yum 源切换为阿里云镜像源

热门文章

  1. java把对象放入数组_如何将对象添加到数组
  2. 6-4 链表拼接 (20分)_数据结构之链表
  3. python保留sqrt_python:quot;因式分解quot;引出的知识盲点
  4. 蓝桥杯2013年省赛C/C++大学组 C/C++
  5. Java 1.1.4 检测字符串是否相等
  6. 目标检测——不同检测算法的对比的图表写作
  7. AttributeError: module ‘cv2.cv2‘ has no attribute ‘bgsegm‘
  8. Activity之间的跳转和四种启动模式
  9. Office - Word 2013
  10. Python-----包和日志的使用