回归分析过程实例(练习)
By:HEHE
本实例是基于:混凝土抗压强度的回归分析
# 导包
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import seaborn as sns%matplotlib inlineimport warnings
warnings.filterwarnings('ignore')import os
1. 数据基本面分析
# path
path_dir = os.path.dirname(os.path.dirname(os.getcwd()))path_data = path_dir + r'\concrete_data.xls'
# load_data
data = pd.read_excel(path_data)
# 查看数据基本面
data.head()
Cement (component 1)(kg in a m^3 mixture) | Blast Furnace Slag (component 2)(kg in a m^3 mixture) | Fly Ash (component 3)(kg in a m^3 mixture) | Water (component 4)(kg in a m^3 mixture) | Superplasticizer (component 5)(kg in a m^3 mixture) | Coarse Aggregate (component 6)(kg in a m^3 mixture) | Fine Aggregate (component 7)(kg in a m^3 mixture) | Age (day) | Concrete compressive strength(MPa, megapascals) | |
---|---|---|---|---|---|---|---|---|---|
0 | 540.0 | 0.0 | 0.0 | 162.0 | 2.5 | 1040.0 | 676.0 | 28 | 79.986111 |
1 | 540.0 | 0.0 | 0.0 | 162.0 | 2.5 | 1055.0 | 676.0 | 28 | 61.887366 |
2 | 332.5 | 142.5 | 0.0 | 228.0 | 0.0 | 932.0 | 594.0 | 270 | 40.269535 |
3 | 332.5 | 142.5 | 0.0 | 228.0 | 0.0 | 932.0 | 594.0 | 365 | 41.052780 |
4 | 198.6 | 132.4 | 0.0 | 192.0 | 0.0 | 978.4 | 825.5 | 360 | 44.296075 |
# 修改列名
data.columns = ['cement_component', 'furnace_slag', 'flay_ash', 'water_component', 'superplasticizer', \'coarse_aggregate', 'fine_aggregate', 'age', 'concrete_strength']
data.head()
cement_component | furnace_slag | flay_ash | water_component | superplasticizer | coarse_aggregate | fine_aggregate | age | concrete_strength | |
---|---|---|---|---|---|---|---|---|---|
0 | 540.0 | 0.0 | 0.0 | 162.0 | 2.5 | 1040.0 | 676.0 | 28 | 79.986111 |
1 | 540.0 | 0.0 | 0.0 | 162.0 | 2.5 | 1055.0 | 676.0 | 28 | 61.887366 |
2 | 332.5 | 142.5 | 0.0 | 228.0 | 0.0 | 932.0 | 594.0 | 270 | 40.269535 |
3 | 332.5 | 142.5 | 0.0 | 228.0 | 0.0 | 932.0 | 594.0 | 365 | 41.052780 |
4 | 198.6 | 132.4 | 0.0 | 192.0 | 0.0 | 978.4 | 825.5 | 360 | 44.296075 |
# 查看数据基本面
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030 entries, 0 to 1029
Data columns (total 9 columns):
cement_component 1030 non-null float64
furnace_slag 1030 non-null float64
flay_ash 1030 non-null float64
water_component 1030 non-null float64
superplasticizer 1030 non-null float64
coarse_aggregate 1030 non-null float64
fine_aggregate 1030 non-null float64
age 1030 non-null int64
concrete_strength 1030 non-null float64
dtypes: float64(8), int64(1)
memory usage: 72.5 KB
# 查看数据基本面
data.describe()
cement_component | furnace_slag | flay_ash | water_component | superplasticizer | coarse_aggregate | fine_aggregate | age | concrete_strength | |
---|---|---|---|---|---|---|---|---|---|
count | 1030.000000 | 1030.000000 | 1030.000000 | 1030.000000 | 1030.000000 | 1030.000000 | 1030.000000 | 1030.000000 | 1030.000000 |
mean | 281.165631 | 73.895485 | 54.187136 | 181.566359 | 6.203112 | 972.918592 | 773.578883 | 45.662136 | 35.817836 |
std | 104.507142 | 86.279104 | 63.996469 | 21.355567 | 5.973492 | 77.753818 | 80.175427 | 63.169912 | 16.705679 |
min | 102.000000 | 0.000000 | 0.000000 | 121.750000 | 0.000000 | 801.000000 | 594.000000 | 1.000000 | 2.331808 |
25% | 192.375000 | 0.000000 | 0.000000 | 164.900000 | 0.000000 | 932.000000 | 730.950000 | 7.000000 | 23.707115 |
50% | 272.900000 | 22.000000 | 0.000000 | 185.000000 | 6.350000 | 968.000000 | 779.510000 | 28.000000 | 34.442774 |
75% | 350.000000 | 142.950000 | 118.270000 | 192.000000 | 10.160000 | 1029.400000 | 824.000000 | 56.000000 | 46.136287 |
max | 540.000000 | 359.400000 | 200.100000 | 247.000000 | 32.200000 | 1145.000000 | 992.600000 | 365.000000 | 82.599225 |
数据基本面总结如下:
- 数据集共1030条数据,特征8个,目标为concrete_strength
- 数据集无缺失值,数据类型全为数值
2. EDA(数据探索性分析)
2.1 concrete_strength
sns.distplot(data['concrete_strength'], bins = 20, color = 'red')
<matplotlib.axes._subplots.AxesSubplot at 0x213da2c2080>
concrete_strength:数据分布正常,稍微有点右偏
2.2 features
plt.figure(figsize = (15,10.5))
plot_count = 1for feature in list(data.columns)[:-1]:plt.subplot(3,3, plot_count)plt.scatter(data[feature], data['concrete_strength'])plt.xlabel(feature.replace('_',' ').title())plt.ylabel('Concrete strength')plot_count +=1plt.show()
plt.figure(figsize=(9,9))
corrmat = data.corr()
sns.heatmap(corrmat, vmax= 0.8, square = True, )
<matplotlib.axes._subplots.AxesSubplot at 0x213ddc4e7b8>
EDA总结:
- 数据相关性都不强,
- cement_component,water_component,superplasticizer,age似乎相关性高一点
- 由于特征都不多,可以分别用这四个特征以及所有特征尝试一遍
- 没有发现异常值
- 还没决定数据要不要标准化
3. model
实验内容:分别使用上面得到的特征,以及所有特征对混凝土强度做预测,同时使用不同的回归算法
from sklearn.model_selection import train_test_split
# 按数据集特征切割训练集测试集
def split_train_test(data, features=None, test_ratio=0.2):y = data['concrete_strength']if features != None:x = data[features]else:x = data.drop(['concrete_strength'], axis=1)train_x, test_x, train_y, test_y = train_test_split(x, y, test_size = test_ratio)return train_x, test_x, train_y, test_y
# 训练集,测试集
train_x, test_x, train_y, test_y = split_train_test(data, test_ratio = 0)
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_scorefrom sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVRfrom sklearn.metrics import r2_score
def data_cross_val(x,y, clfs, clfs_name, cv= 5):for i,clf in enumerate(clfs):scores = cross_val_score(estimator=clf, X= x, y= y, cv=cv, scoring ='r2')print(clfs_name[i])print('the R2 score: %f' % np.mean(scores))
3.1 所有特征做回归
clfs = [LinearRegression(), Ridge(), Lasso(), ElasticNet(), GradientBoostingRegressor(), SVR()]
clfs_name = ['LinearRegression', 'Ridge', 'Lasso', 'ElasticNet', 'GradientBoostingRegressor', 'SVR']
data_cross_val(train_x, train_y, clfs,clfs_name, cv = 5)
LinearRegression
the R2 score: 0.604974
Ridge
the R2 score: 0.604974
Lasso
the R2 score: 0.605090
ElasticNet
the R2 score: 0.605220
GradientBoostingRegressor
the R2 score: 0.908837
SVR
the R2 score: 0.023249
结论:单一的回归器还是没有梯度提升机好,可以尝试用bagging和stacking的方式再实验一下,或者增加特征。
3.2 部分相关特征做回归
# 训练集,测试集
features = ['cement_component','water_component','superplasticizer','age']
train_x, test_x, train_y, test_y = split_train_test(data, features, test_ratio = 0)
clfs = [LinearRegression(), Ridge(), Lasso(), ElasticNet(), GradientBoostingRegressor(), SVR()]
clfs_name = ['LinearRegression', 'Ridge', 'Lasso', 'ElasticNet', 'GradientBoostingRegressor', 'SVR']
data_cross_val(train_x, train_y, clfs,clfs_name, cv = 5)
LinearRegression
the R2 score: 0.485046
Ridge
the R2 score: 0.485045
Lasso
the R2 score: 0.484828
ElasticNet
the R2 score: 0.484840
GradientBoostingRegressor
the R2 score: 0.830816
SVR
the R2 score: 0.043992
总结:目前来说使用部分相关的特征来做回归,由于特征数目太少,还不如用所有特征来的比较好
3.3 单线性回归
plt.figure(figsize=(15,7))
plot_count = 1for feature in ['cement_component', 'flay_ash', 'water_component', 'superplasticizer', 'coarse_aggregate']:data_tr = data[['concrete_strength', feature]]x_train, x_test, y_train, y_test = split_train_test(data_tr, [feature])# Create linear regression objectregr = LinearRegression()# Train the model using the training setsregr.fit(x_train, y_train)y_pred = regr.predict(x_test)# Plot outputsplt.subplot(2,3,plot_count)plt.scatter(x_test, y_test, color='black')plt.plot(x_test, y_pred, color='blue',linewidth=3)plt.xlabel(feature.replace('_',' ').title())plt.ylabel('Concrete strength')print(feature, r2_score(y_test, y_pred))plot_count+=1plt.show()
cement_component 0.24550132796330282
flay_ash 0.012228585601186226
water_component 0.09828887425075417
superplasticizer 0.11471267678235075
coarse_aggregate 0.02046823335033021
features = ['cement_component', 'flay_ash', 'water_component', 'superplasticizer', 'coarse_aggregate']data_tr = data
data_tr=data_tr[(data_tr.T != 0).all()]x_train, x_test, y_train, y_test = split_train_test(data_tr, features)# Create linear regression object
regr = LinearRegression()# Train the model using the training sets
regr.fit(x_train, y_train)
y_pred = regr.predict(x_test)plt.scatter(range(len(y_test)), y_test, color='black')
plt.plot(y_pred, color='blue', linewidth=3)print('Features: %s'%str(features))
print('R2 score: %f'%r2_score(y_test, y_pred))
print('Intercept: %f'%regr.intercept_)
print('Coefficients: %s'%str(regr.coef_))
Features: ['cement_component', 'flay_ash', 'water_component', 'superplasticizer', 'coarse_aggregate']
R2 score: 0.155569
Intercept: 84.481913
Coefficients: [ 0.04304209 -0.02577486 -0.1747249 0.15980663 -0.02633656]
alphas = np.arange(0.1,5,0.1)model = Ridge()
cv = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas))y_pred = cv.fit(x_train, y_train).predict(x_test)plt.scatter(range(len(y_test)), y_test, color='black')
plt.plot(y_pred, color='blue', linewidth=3)print('Features: %s'%str(features))
print('R2 score: %f'%r2_score(y_test, y_pred))
print('Intercept: %f'%regr.intercept_)
print('Coefficients: %s'%str(regr.coef_))
Features: ['cement_component', 'flay_ash', 'water_component', 'superplasticizer', 'coarse_aggregate']
R2 score: 0.155562
Intercept: 84.481913
Coefficients: [ 0.04304209 -0.02577486 -0.1747249 0.15980663 -0.02633656]
model = Lasso()
cv = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas))y_pred = cv.fit(x_train, y_train).predict(x_test)plt.scatter(range(len(y_test)), y_test, color='black')
plt.plot(y_pred, color='blue', linewidth=3)print('Features: %s'%str(features))
print('R2 score: %f'%r2_score(y_test, y_pred))
print('Intercept: %f'%regr.intercept_)
print('Coefficients: %s'%str(regr.coef_))
Features: ['cement_component', 'flay_ash', 'water_component', 'superplasticizer', 'coarse_aggregate']
R2 score: 0.151682
Intercept: 84.481913
Coefficients: [ 0.04304209 -0.02577486 -0.1747249 0.15980663 -0.02633656]
model = ElasticNet()
cv = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas))y_pred = cv.fit(x_train, y_train).predict(x_test)plt.scatter(range(len(y_test)), y_test, color='black')
plt.plot(y_pred, color='blue', linewidth=3)print('Features: %s'%str(features))
print('R2 score: %f'%r2_score(y_test, y_pred))
print('Intercept: %f'%regr.intercept_)
print('Coefficients: %s'%str(regr.coef_))
Features: ['cement_component', 'flay_ash', 'water_component', 'superplasticizer', 'coarse_aggregate']
R2 score: 0.151796
Intercept: 84.481913
Coefficients: [ 0.04304209 -0.02577486 -0.1747249 0.15980663 -0.02633656]
plt.figure(figsize=(15,7))
plot_count = 1for feature in ['cement_component', 'flay_ash', 'water_component', 'superplasticizer', 'coarse_aggregate']:data_tr = data[['concrete_strength', feature]]data_tr=data_tr[(data_tr.T != 0).all()]x_train, x_test, y_train, y_test = split_train_test(data_tr, [feature])# Create linear regression objectregr = GradientBoostingRegressor()# Train the model using the training setsregr.fit(x_train, y_train)y_pred = regr.predict(x_test)# Plot outputsplt.subplot(2,3,plot_count)plt.scatter(x_test, y_test, color='black')plt.plot(x_test, y_pred, color='blue',linewidth=3)plt.xlabel(feature.replace('_',' ').title())plt.ylabel('Concrete strength')print(feature, r2_score(y_test, y_pred))plot_count+=1plt.show()
cement_component 0.35248985320039705
flay_ash 0.17319875701989795
water_component 0.285023360910455
superplasticizer 0.19306275412216778
coarse_aggregate 0.17712532312647877
model = GradientBoostingRegressor()y_pred = model.fit(x_train, y_train).predict(x_test)plt.scatter(range(len(y_test)), y_test, color='black')
plt.plot(y_pred, color='blue',linewidth=3)print('Features: %s'%str(features))
print('R2 score: %f'%r2_score(y_test, y_pred))
#print('Intercept: %f'%regr.intercept_)
#print('Coefficients: %s'%str(regr.coef_))
Features: ['cement_component', 'flay_ash', 'water_component', 'superplasticizer', 'coarse_aggregate']
R2 score: 0.177125
plt.figure(figsize=(15,7))
plot_count = 1for feature in ['cement_component', 'flay_ash', 'water_component', 'superplasticizer', 'coarse_aggregate']:data_tr = data[['concrete_strength', feature]]data_tr=data_tr[(data_tr.T != 0).all()]x_train, x_test, y_train, y_test = split_train_test(data_tr, [feature])# Create linear regression objectregr = SVR(kernel='linear')# Train the model using the training setsregr.fit(x_train, y_train)y_pred = regr.predict(x_test)# Plot outputsplt.subplot(2,3,plot_count)plt.scatter(x_test, y_test, color='black')plt.plot(x_test, y_pred, color='blue', linewidth=3)plt.xlabel(feature.replace('_',' ').title())plt.ylabel('Concrete strength')print(feature, r2_score(y_test, y_pred))plot_count+=1plt.show()
cement_component 0.2054832593541437
flay_ash -0.044636249705873654
water_component 0.07749271320026574
superplasticizer 0.0671220299245393
coarse_aggregate 0.016036478490831563
model = SVR(kernel='linear')y_pred = model.fit(x_train, y_train).predict(x_test)plt.scatter(range(len(y_test)), y_test, color='black')
plt.plot(y_pred, color='blue', linewidth=3)print('Features: %s'%str(features))
print('R2 score: %f'%r2_score(y_test, y_pred))
Features: ['cement_component', 'flay_ash', 'water_component', 'superplasticizer', 'coarse_aggregate']
R2 score: 0.016036
4. 使用 cement_component和 water_component预测concrete_strength
feature = 'cement_component'
cc_new_data = np.array([[213.5]])data_tr = data[['concrete_strength', feature]]
data_tr=data_tr[(data_tr.T != 0).all()]x_train, x_test, y_train, y_test = split_train_test(data_tr, [feature])regr = GradientBoostingRegressor()# Train the model using the training setsregr.fit(x_train, y_train)
cs_pred = regr.predict(cc_new_data)
print('Predicted value of concrete strength: %f'%cs_pred)
Predicted value of concrete strength: 36.472380
feature = 'water_component'
wc_new_data = np.array([[200]])data_tr = data[['concrete_strength', feature]]
data_tr=data_tr[(data_tr.T != 0).all()]x_train, x_test, y_train, y_test = split_train_test(data_tr, [feature])regr = GradientBoostingRegressor()# Train the model using the training sets
regr.fit(x_train, y_train)
cs_pred = regr.predict(wc_new_data)
print('Predicted value of concrete strength: %f'%cs_pred)
Predicted value of concrete strength: 32.648425
转载于:https://www.cnblogs.com/llssx/p/10612940.html
回归分析过程实例(练习)相关推荐
- cadence spb 16.5 破解过程实例和使用感受_赤松子耶_新浪博客
cadence spb 16.5 破解过程实例和使用感受_赤松子耶_新浪博客 Cadence Allegro16.5详细安装具体的步骤 1.下载SPB16.5下来后,点setup.exe,先安装第一项 ...
- 2021-04-27 Android 理解frameworks services jni hardware kernel 整个控制过程实例包括回调
Android 理解frameworks services jni hardware kernel 整个控制过程实例包括回调 一.这个例子的实现的功能是,app控制power pin和control ...
- R语言与回归分析计算实例
6.1.7 计算实例 这里用Forbes数据为例,全面展示一元回归模型的计算过程. 例 6.5 Forbes数据 在十九世纪四.五十年代,苏格兰物理学家James D. Forbes,试图通过水的沸点 ...
- 函数调用过程实例详解
原文标题:<函数调用过程探究> 引言 如何定义函数.调用函数,是每个程序员学习编程的入门课.调用函数(caller)向被调函数(callee)传入参数,被调函数返回结果,看似简单的过程,其 ...
- 详细BP神经网络预测算法及实现过程实例
1.具体应用实例.根据表2,预测序号15的跳高成绩. 表2 国内男子跳高运动员各项素质指标 序号 跳高成绩() 30行进跑(s) 立定三级跳远() 助跑摸高() 助跑4-6步跳高() 负重深蹲杠铃() ...
- 理解热插拔技术:热插拔保护电路设计过程实例
服务器.网络交换机.冗余存储磁盘阵列(RAID),以及其它形式的通信基础设施等高可用性系统,需要在整个使用生命周期内具有接 近零的停机率.如果这种系统的一个部件发生了故障或是需要升级,它必须在不中断系 ...
- R语言与多元线性回归分析计算实例
6.3.7 计算实例 例 6.9 某大型牙膏制造企业为了更好地拓展产品市场,有效地管理库存,公司董事会要求销售部门根据市场调查,找出公司生产的牙膏销售量与销售价格,广告投入等之间的关系,从而预测出在不 ...
- SSL握手过程实例抓包分析
为了更好理解SSL协议的握手过程,结合实例,使用Wireshark抓包分析SSL握手过程中客户端与服务器间的交互过程.本例中服务器为AcFun弹幕视频网 - 认真你就输啦 (・ω・)ノ- ( ゜- ゜ ...
- TDD系列3-TDD过程实例-保龄球单局积分算法
认识了TDD,我们以实际案例过程来更好的学习TDD. 案例需求 保龄球单局积分规则为: 1.保龄球按顺序每轮允许投2个球,投完10轮为1局. 2.每击倒1个瓶得1分.投完一轮将两个球的"所得 ...
最新文章
- matplotlib subplot画子图
- java list 遍历 删除元素_java中List遍历删除元素相关做法和注意事项
- nginx学习笔记(7)Nginx如何处理一个请求---转载
- 计算机进桌面后反复重启,我的电脑一插网线就自动重启。到界面之后又马上重启。一直循环。...
- 可视化太酷辽!一文了解排序和搜索算法在前端中的应用
- 本地 Windows 如何将 Web 工程部署到远程 Windows 主机上
- c++函数返回值是一个引用
- curl 伪装来路(referer)
- Kafka 源码分析之网络层(二)
- Kafka的架构设计
- 常用Latex表达式符号——组合数学篇
- 飞猪:国庆乡村民宿订单量涨560% 00后红色旅游订单量同比涨80%
- 进程调度的时机,切换与过程,方式
- BZOJ1251序列终结者——非旋转treap
- js排序算法详解-归并排序
- 获取单选按钮选中的值
- 表示计算机运行快的词,形容电脑打字快的成语_四字词语 - 成梦词典
- 三门问题与神奇的贝叶斯大脑
- TestNG教程二:testNG常用测试类型
- 欧几里德结构数据(Euclidean Structure Data) 以及非欧几里德结构数据(Non-Euclidean Structure Data)
热门文章
- Newstar Ctf 2022| week2 wp
- 逃离996的最强出路!
- 解密!高德地图九大绝密卷宗带你畅游上海迪士尼
- 使用mathtype后,word自动生成目录中出现“Equation Chapter (Next) Section 1字样”
- Android端集成大疆SDK(MSDK)
- mysql修改初始密码....
- RxJS结合vue-rx, Akita的介绍和使用
- Linux 命令行操作 while read ; cut 提取指定列; uniq命令并计数;sort 命令倒序查找
- lol服务器维护一天,lol维护一天一夜,你就给我三胜经验补偿卡?
- 刘兵《Entity and aspect extraction for opinion mining 》翻译笔记