天池学习赛:工业蒸汽量预测4——模型验证
上一篇《天池学习赛:工业蒸汽量预测3——模型训练》中已经是使用了几种机器学习的模型,接下来将介绍一些模型的评价方法。
目录
- 1 模型评估的方法
- 2 模型调参
- 3 赛题模型验证与调参
- 3.1 模型过拟合与欠拟合
- 3.2 模型正则化
- 3.3 模型交叉验证
- 3.4 模型超参空间及调参
- 3.5 学习曲线和验证曲线
1 模型评估的方法
1 欠拟合与过拟合
2 泛化与正则化
3 回归模型评价指标与调用方法
(1)平均绝对误差
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test,y_pred)
(2)均方误差
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test,y_pred)
(3)均方根误差
from sklearn.metrics import mean_squared_error
Pred_Error=mean_squared_error(y_test,y_pred)
Sqrt(Pred_Error)
(4)R平方值
from sklearn.metrics import r2_score
r2_score(y_test,y_pred)
4 交叉验证
(1)简单交叉验证:
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(iris.data,iris.target,test_size=.4,random_state=0)
(2)K折交叉验证:
from sklearn.model_selection import KFold
kf=KFold(n_splits=10)
(3)留一法交叉验证:
from sklearn.model_selection import LeaveOneOut
l0o=LeaveOneOut()
(4)留P法交叉验证:
from sklearn.model_selection import LeavePOut
lpo=LeavePOut(p=5)
2 模型调参
1 调参
2 网格搜索
对所有可能的参数组合,依次进行训练,最后找出最好的参数组合
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import train_test_splitiris=load_iris()
X_train,X_test,Y_train,Y_test=train_test_split(iris.data,iris.target,random_state=0)
print("size of training set:{} size of teating set:{}".format(X_train[0],X_test[0]))best_score=0
for gamma in[0.001,0.01,0.1,1,10,100]:for c in[0.001,0.01,0.1,1,10,100]:svm=SVC(gamma=gamma,C=c)svm.fit(X_train,Y_train)score=svm.score(X_test,Y_test)if score >best_score:best_score=scorebest_parameters={'gamma':gamma,'C':c}print("Best score:{:.2f}",format(best_score))
print('Best parameters:{}'.format(best_parameters))
可以从输出结果看出gamma=0.001与C=100的组合最好,得分最高。
size of training set:[5.9 3. 4.2 1.5] size of teating set:[5.8 2.8 5.1 2.4]
Best score:0.97
Best parameters:{‘gamma’: 0.001, ‘C’: 100}
3 学习曲线
3 赛题模型验证与调参
3.1 模型过拟合与欠拟合
1 基础代码
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn import preprocessing
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.decomposition import PCA
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split #切分数据
from sklearn.metrics import mean_squared_error #评价指标
from sklearn.linear_model import LinearRegression #从sklearn引入线性模型
from sklearn.neighbors import KNeighborsRegressor #k近邻回归模型
from sklearn.tree import DecisionTreeRegressor #决策树回归模型
from sklearn.ensemble import RandomForestRegressor #随机森林回归模型
from lightgbm import LGBMRegressor #LightGBM回归模型
from sklearn.svm import SVR #支持向量机
from sklearn.linear_model import SGDRegressor#读数据
train_data_file = "./zhengqi_train.txt"
test_data_file = "./zhengqi_test.txt"
train_data=pd.read_csv(train_data_file,sep='\t',encoding='utf-8')
test_data=pd.read_csv(test_data_file,sep='\t',encoding='utf-8')#归一化
features_columns=[col for col in train_data.columns if col not in ['target']]
min_max_scaler=preprocessing.MinMaxScaler()
min_max_scaler=min_max_scaler.fit(train_data[features_columns])
train_data_scaler=min_max_scaler.transform(train_data[features_columns])
test_data_scaler=min_max_scaler.transform(test_data[features_columns])
train_data_scaler=pd.DataFrame(train_data_scaler)
train_data_scaler.columns=features_columns
test_data_scaler=pd.DataFrame(test_data_scaler)
test_data_scaler.columns=features_columns
train_data_scaler['target']=train_data['target']#PCA降维
pca=PCA(n_components=0.9)
new_train_pca_16=pca.fit_transform(train_data_scaler.iloc[:,0:-1])
new_test_pca_16=pca.transform(test_data_scaler)
new_train_pca_16=pd.DataFrame(new_train_pca_16)
new_test_pca_16=pd.DataFrame(new_test_pca_16)
new_train_pca_16['target']=train_data_scaler['target']#采用PCA保留的16维特征的数据
new_train_pca_16=new_train_pca_16.fillna(0)
train=new_train_pca_16[new_train_pca_16.columns]
target=new_train_pca_16['target']#划分数据集 训练集80%验证机20%
train_data,test_data,train_target,test_target=train_test_split(train,target,\test_size=0.2,random_state=0)
2 欠拟合
clf=SGDRegressor(max_iter=500,tol=1e-2)
clf.fit(train_data,train_target)
score_train=mean_squared_error(train_target,clf.predict(train_data))
score_test=mean_squared_error(test_target,clf.predict(test_data))
print("SGDRegressor train MSE:",score_train)
print("SGDRegressor test MSE:",score_test)
3 过拟合
from sklearn.preprocessing import PolynomialFeatures
poly=PolynomialFeatures(5)
train_data_poly=poly.fit_transform(train_data)
test_data_poly=poly.fit_transform(test_data)
clf=SGDRegressor(max_iter=1000,tol=1e-3)
clf.fit(train_data_poly,train_target)
score_train=mean_squared_error(train_target,clf.predict(train_data_poly))
score_test=mean_squared_error(test_target,clf.predict(test_data_poly))
print("SGDRegressor train MSE:",score_train)
print("SGDRegressor test MSE:",score_test)
4 正常拟合
from sklearn.preprocessing import PolynomialFeatures
poly=PolynomialFeatures(3)
train_data_poly=poly.fit_transform(train_data)
test_data_poly=poly.fit_transform(test_data)
clf=SGDRegressor(max_iter=1000,tol=1e-3)
clf.fit(train_data_poly,train_target)
score_train=mean_squared_error(train_target,clf.predict(train_data_poly))
score_test=mean_squared_error(test_target,clf.predict(test_data_poly))
print("SGDRegressor train MSE:",score_train)
print("SGDRegressor test MSE:",score_test)
3.2 模型正则化
1 L2正则化
在上面正常拟合的代码中加入正则化项
from sklearn.preprocessing import PolynomialFeatures
poly=PolynomialFeatures(3)
train_data_poly=poly.fit_transform(train_data)
test_data_poly=poly.fit_transform(test_data)
clf=SGDRegressor(max_iter=1000,tol=1e-3,penalty='L2',alpha=0.0001)
clf.fit(train_data_poly,train_target)
score_train=mean_squared_error(train_target,clf.predict(train_data_poly))
score_test=mean_squared_error(test_target,clf.predict(test_data_poly))
print("SGDRegressor train MSE:",score_train)
print("SGDRegressor test MSE:",score_test)
2 L1正则化
同上
from sklearn.preprocessing import PolynomialFeatures
poly=PolynomialFeatures(3)
train_data_poly=poly.fit_transform(train_data)
test_data_poly=poly.fit_transform(test_data)
clf=SGDRegressor(max_iter=1000,tol=1e-3,penalty='L1',alpha=0.0001)
clf.fit(train_data_poly,train_target)
score_train=mean_squared_error(train_target,clf.predict(train_data_poly))
score_test=mean_squared_error(test_target,clf.predict(test_data_poly))
print("SGDRegressor train MSE:",score_train)
print("SGDRegressor test MSE:",score_test)
3 ElasticNet联合L1和L2范数加权正则化
from sklearn.preprocessing import PolynomialFeatures
poly=PolynomialFeatures(3)
train_data_poly=poly.fit_transform(train_data)
test_data_poly=poly.fit_transform(test_data)
clf=SGDRegressor(max_iter=1000,tol=1e-3,penalty='elasticnet',l1_ratio=0.9,alpha=.0001)
clf.fit(train_data_poly,train_target)
score_train=mean_squared_error(train_target,clf.predict(train_data_poly))
score_test=mean_squared_error(test_target,clf.predict(test_data_poly))
print("SGDRegressor train MSE:",score_train)
print("SGDRegressor test MSE:",score_test)
3.3 模型交叉验证
1 简单交叉
2 K折交叉
from sklearn.model_selection import KFold
kf=KFold(n_splits=5)
for k,(train_index,test_index) in enumerate(kf.split(train)):train_data,test_data,train_target,test_target=train.values[train_index],\train.values[test_index],\target[train_index],\target[test_index]clf = SGDRegressor(max_iter=1000, tol=1e-3)clf.fit(train_data, train_target)score_train = mean_squared_error(train_target, clf.predict(train_data))score_test = mean_squared_error(test_target, clf.predict(test_data))print(k,"折","SGDRegressor train MSE:", score_train)print(k,"折","SGDRegressor test MSE:", score_test,"\n")
3.留一法交叉
from sklearn.model_selection import LeaveOneOut
loo=LeaveOneOut()
num=100
for k,(train_index,test_index) in enumerate(loo.split(train)):train_data, test_data, train_target, test_target = train.values[train_index], \train.values[test_index], \target[train_index], \target[test_index]clf = SGDRegressor(max_iter=1000, tol=1e-3)clf.fit(train_data, train_target)score_train = mean_squared_error(train_target, clf.predict(train_data))score_test = mean_squared_error(test_target, clf.predict(test_data))print(k, "个", "SGDRegressor train MSE:", score_train)print(k, "个", "SGDRegressor test MSE:", score_test, "\n")if k>=9:break
4 留P法交叉
from sklearn.model_selection import LeavePOut
lpo=LeavePOut(p=10)
num=100
for k,(train_index,test_index) in enumerate(lpo.split(train)):train_data, test_data, train_target, test_target = train.values[train_index], \train.values[test_index], \target[train_index], \target[test_index]clf = SGDRegressor(max_iter=1000, tol=1e-3)clf.fit(train_data, train_target)score_train = mean_squared_error(train_target, clf.predict(train_data))score_test = mean_squared_error(test_target, clf.predict(test_data))print(k, "10个", "SGDRegressor train MSE:", score_train)print(k, "10个", "SGDRegressor test MSE:", score_test, "\n")if k >= 9:break
3.4 模型超参空间及调参
1 穷举网格搜索
#使用数据训练随机森林模型,采用穷举网格搜索方法调参
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressorRandomForestRegressor=RandomForestRegressor()
parameters={'n_estimators':[50,100,200],'max_depth':[1,2,3]}
clf=GridSearchCV(RandomForestRegressor,parameters,cv=5)
clf.fit(train_data,train_target)
score_test=mean_squared_error(test_target,clf.predict(test_data))
print("RandomForestRegressor GridSearchCV test MSE:",score_test)
print(sorted(clf.cv_results_.keys())) #包含训练时间和验证指标的一些信息
2 随机参数优化
#使用数据训练随机森林模型,采用随即参数优化方法调参
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
RandomForestRegressor=RandomForestRegressor()
parameters={'n_estimators':[50,100,200,300],'max_depth':[1,2,3,4,5]
}
clf=RandomizedSearchCV(RandomForestRegressor,parameters,cv=5)
clf.fit(train_data,train_target)
print('Best parameters found are:',clf.best_params_)
score_test=mean_squared_error(test_target,clf.predict(test_data))
print("RandomForestRegressor RandomizedSearchCV GridSearchCV test MSE:",score_test)
print(sorted(clf.cv_results_.keys())) #包含训练时间和验证指标的一些信息
3 LGB调参
import lightgbm as lgbclf=lgb.LGBMRegressor(num_leaves=31)
parameters={'learning_rate':[0.01,0.1,1],'n_estimators':[20,40]}
clf.fit(train_data,train_target)
score_test=mean_squared_error(test_target,clf.predict(test_data))
print("LGBMRegressor GridSearchCV test MSE:",score_test)
3.5 学习曲线和验证曲线
1 学习曲线
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import ShuffleSplit
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import learning_curvetrain_data_file = "./zhengqi_train.txt"
test_data_file = "./zhengqi_test.txt"
train_data=pd.read_csv(train_data_file,sep='\t',encoding='utf-8')
test_data=pd.read_csv(test_data_file,sep='\t',encoding='utf-8')plt.figure(figsize=(18,10),dpi=150)def plot_learning_curve(estimator,title,x,y,ylim=None,cv=None,n_jobs=1,train_sizes=np.linspace(.1,1.0,5)):plt.figure()plt.title(title)if ylim is not None:plt.ylim(*ylim)plt.xlabel("Training examples")plt.ylabel("Score")train_sizes,train_scores,test_scores=learning_curve(estimator,x,y,cv=cv,n_jobs=n_jobs,train_sizes=train_sizes)train_scores_mean=np.mean(train_scores,axis=1)train_scores_std=np.std(train_scores,axis=1)test_scores_mean=np.mean(test_scores,axis=1)test_scores_std=np.std(test_scores,axis=1)plt.grid()plt.fill_between(train_sizes,train_scores_mean-train_scores_std,train_scores_mean+train_scores_std,alpha=0.1,color='r')plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1,color='g')plt.plot(train_sizes,train_scores_mean,'o-',color='r',label="training score")plt.plot(train_sizes,test_scores_mean,'o-',color='g',label="corss-validation score")plt.legend(loc="best")return plt
x=train_data[test_data.columns].values
y=train_data['target'].values
title="LinearRegression"
cv=ShuffleSplit(n_splits=100,test_size=0.2,random_state=0)
estimator=SGDRegressor()
plot_learning_curve(estimator,title,x,y,ylim=(0.7,1.01),cv=cv,n_jobs=-1).show()
2 验证曲线
绘制数据训练SGDRegressor模型的曲线
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import validation_curvetrain_data_file = "./zhengqi_train.txt"
test_data_file = "./zhengqi_test.txt"
train_data=pd.read_csv(train_data_file,sep='\t',encoding='utf-8')
test_data=pd.read_csv(test_data_file,sep='\t',encoding='utf-8')
x=train_data[test_data.columns].values
y=train_data['target'].valuesparam_range=[0.1,0.01,0.001,0.0001,0.00001,0.000001]
train_scores,test_scores=validation_curve(SGDRegressor(max_iter=1000,tol=1e-3,penalty='L1')\,x,y,param_name='alpha',param_range=param_range,\cv=10,scoring='r2',n_jobs=1)
train_scores_mean=np.mean(train_scores,axis=1)
train_scores_std=np.std(train_scores,axis=1)
test_scores_mean=np.mean(test_scores,axis=1)
test_scores_std=np.std(test_scores,axis=1)
plt.title("Validation Curve with SCDRegressor")
plt.xlabel("alpha")
plt.ylabel("Score")
plt.ylim(0.0,1.1)
plt.semilogx(param_range,train_scores_mean,label="Training scores",color='r')
plt.fill_between(param_range,train_scores_mean-train_scores_std,train_scores_mean+train_scores_std,alpha=0.2,color='r')
plt.semilogx(param_range,test_scores_mean,label="Sross_validation score",color='g')
plt.fill_between(param_range,train_scores_mean-train_scores_std,train_scores_mean+train_scores_std,alpha=0.2,color='g')
plt.legend(loc="best")
plt.show()
下一篇《天池学习赛:工业蒸汽量预测5——特征优化》
天池学习赛:工业蒸汽量预测4——模型验证相关推荐
- 天池学习赛:工业蒸汽量预测3——模型训练
接上一篇<天池学习赛:工业蒸汽量预测2--特征工程> 数据划分: from sklearn.model_selection import train_test_split #切分数据new ...
- 天池大赛之工业蒸汽量预测(有史以来最全面)
目录 1.导包与数据挖掘 1.1导包 1.2 数据载入 1.3 数据合并 1.4 数据分布 1.5 特征清洗 1.6 特征可视化 1.7 相关性系数 1.8 归一化 1.9 Box-Cox变换对连续变 ...
- 天池比赛:工业蒸汽量预测
https://tianchi.aliyun.com/competition/entrance/231693/introduction 偶然看到一句话:最重要的是提特征,特征决定上限,模型只是无限逼近 ...
- 天池学习赛:工业蒸汽量预测5——特征优化
上一篇<天池学习赛:工业蒸汽量预测4--模型验证> 目录 1 特征优化的方法 1.1 合成特征 1.2 特征变换 1.3 用决策树创造新特征 1.4 特征组合 2 赛题特征优化代码 1 特 ...
- 天池学习赛:工业蒸汽量预测2——特征工程
上一篇<天池学习赛:工业蒸汽量预测1--数据探索> 目录 1.特征工程 1.1 预处理 1.2 特征处理 1.3 特征降维 1.3.1 特征选择 1.3.2 线性降维 2.赛题代码 3 结 ...
- 天池工业蒸汽量预测-模型调参
本文改编自<阿里云天池大赛赛题解析-机器学习篇>工业蒸汽量预测的模型调参.进行了部分素材的替换和知识点的归纳总结.新增了Datawhale8月集成学习中的网格搜索.随机搜索的内容 上一篇工 ...
- python建模大赛算法_Python数据分析kaggle-Titanic+天池-工业蒸汽量预测建模算法
做数据分析许久了, 简单写写比赛的数据分析项目思路 一 使用逻辑回归/随机森林等对kaggle比赛项目 "给出泰坦尼克号上的乘客的信息, 预测乘客是否幸存"进行简单的数据分析过程, ...
- 天池学习赛:工业蒸汽量预测1——数据探索
目录 0.赛题介绍 1.数据分析知识 2.代码实现 0.赛题介绍 火力发电的基本原理是:燃料在燃烧时加热水生成蒸汽,蒸汽压力推动汽轮机旋转,然后汽轮机带动发电机旋转,产生电能.在这一系列的能量转化中, ...
- 【机器学习】阿里云天池竞赛——工业蒸汽量预测(2)
机器学习经典赛题:工业蒸汽量预测(2) 机器学习经典赛题:工业蒸汽量预测(2) 3.1 特征工程的重要性和处理 3.2 数据预处理和特征处理 3.2.1 数据预处理 3.2.2 特征处理 3.3 特征 ...
最新文章
- Apache模块开发helloworld无错版
- containerd镜像导入import和push命令 gzip压缩解压缩命令
- Linux的僵尸进程产生原因及解决方法
- 蓝桥杯2014年省赛C/C++ 本科B组
- 解决window.location.href 下载文件时,一次点击产生两次下载+页面跳转问题
- Vue源码后记-vFor列表渲染(3)
- 自学python-python自学难吗
- 如何在HTML中加载一个CSS文件?
- jeDate日期控件
- 右脑图像记忆法原理和方法入门
- 射灯安装方法图解_射灯如何安装—射灯的安装方法介绍
- ios 描述文件位置
- 如何使用微信小程序制作banner轮播图?
- 【GAOPS050】自同步加扰和帧同步加扰
- 操作系统——linux
- PHP 创建推广海报
- 猴年快乐! 在UI设计中解密农历新年的象征意义
- 一次云服务器购买经历,给个人建站的小白做些参考
- Safari 浏览器 16.0 发布(含独立安装包下载)
- Docker 的官方 yum 源切换为阿里云镜像源
热门文章
- java把对象放入数组_如何将对象添加到数组
- 6-4 链表拼接 (20分)_数据结构之链表
- python保留sqrt_python:quot;因式分解quot;引出的知识盲点
- 蓝桥杯2013年省赛C/C++大学组 C/C++
- Java 1.1.4 检测字符串是否相等
- 目标检测——不同检测算法的对比的图表写作
- AttributeError: module ‘cv2.cv2‘ has no attribute ‘bgsegm‘
- Activity之间的跳转和四种启动模式
- Office - Word 2013
- Python-----包和日志的使用