By:HEHE

本实例是基于:混凝土抗压强度的回归分析

# 导包
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import seaborn as sns%matplotlib inlineimport warnings
warnings.filterwarnings('ignore')import os

1. 数据基本面分析

# path
path_dir = os.path.dirname(os.path.dirname(os.getcwd()))path_data = path_dir +  r'\concrete_data.xls'
# load_data
data = pd.read_excel(path_data)
# 查看数据基本面
data.head()
Cement (component 1)(kg in a m^3 mixture) Blast Furnace Slag (component 2)(kg in a m^3 mixture) Fly Ash (component 3)(kg in a m^3 mixture) Water (component 4)(kg in a m^3 mixture) Superplasticizer (component 5)(kg in a m^3 mixture) Coarse Aggregate (component 6)(kg in a m^3 mixture) Fine Aggregate (component 7)(kg in a m^3 mixture) Age (day) Concrete compressive strength(MPa, megapascals)
0 540.0 0.0 0.0 162.0 2.5 1040.0 676.0 28 79.986111
1 540.0 0.0 0.0 162.0 2.5 1055.0 676.0 28 61.887366
2 332.5 142.5 0.0 228.0 0.0 932.0 594.0 270 40.269535
3 332.5 142.5 0.0 228.0 0.0 932.0 594.0 365 41.052780
4 198.6 132.4 0.0 192.0 0.0 978.4 825.5 360 44.296075
# 修改列名
data.columns = ['cement_component', 'furnace_slag', 'flay_ash', 'water_component', 'superplasticizer', \'coarse_aggregate', 'fine_aggregate', 'age', 'concrete_strength']
data.head()
cement_component furnace_slag flay_ash water_component superplasticizer coarse_aggregate fine_aggregate age concrete_strength
0 540.0 0.0 0.0 162.0 2.5 1040.0 676.0 28 79.986111
1 540.0 0.0 0.0 162.0 2.5 1055.0 676.0 28 61.887366
2 332.5 142.5 0.0 228.0 0.0 932.0 594.0 270 40.269535
3 332.5 142.5 0.0 228.0 0.0 932.0 594.0 365 41.052780
4 198.6 132.4 0.0 192.0 0.0 978.4 825.5 360 44.296075
# 查看数据基本面
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030 entries, 0 to 1029
Data columns (total 9 columns):
cement_component     1030 non-null float64
furnace_slag         1030 non-null float64
flay_ash             1030 non-null float64
water_component      1030 non-null float64
superplasticizer     1030 non-null float64
coarse_aggregate     1030 non-null float64
fine_aggregate       1030 non-null float64
age                  1030 non-null int64
concrete_strength    1030 non-null float64
dtypes: float64(8), int64(1)
memory usage: 72.5 KB
# 查看数据基本面
data.describe()
cement_component furnace_slag flay_ash water_component superplasticizer coarse_aggregate fine_aggregate age concrete_strength
count 1030.000000 1030.000000 1030.000000 1030.000000 1030.000000 1030.000000 1030.000000 1030.000000 1030.000000
mean 281.165631 73.895485 54.187136 181.566359 6.203112 972.918592 773.578883 45.662136 35.817836
std 104.507142 86.279104 63.996469 21.355567 5.973492 77.753818 80.175427 63.169912 16.705679
min 102.000000 0.000000 0.000000 121.750000 0.000000 801.000000 594.000000 1.000000 2.331808
25% 192.375000 0.000000 0.000000 164.900000 0.000000 932.000000 730.950000 7.000000 23.707115
50% 272.900000 22.000000 0.000000 185.000000 6.350000 968.000000 779.510000 28.000000 34.442774
75% 350.000000 142.950000 118.270000 192.000000 10.160000 1029.400000 824.000000 56.000000 46.136287
max 540.000000 359.400000 200.100000 247.000000 32.200000 1145.000000 992.600000 365.000000 82.599225

数据基本面总结如下:

  1. 数据集共1030条数据,特征8个,目标为concrete_strength
  2. 数据集无缺失值,数据类型全为数值

2. EDA(数据探索性分析)

2.1 concrete_strength

sns.distplot(data['concrete_strength'], bins = 20, color = 'red')
<matplotlib.axes._subplots.AxesSubplot at 0x213da2c2080>

concrete_strength:数据分布正常,稍微有点右偏

2.2 features

plt.figure(figsize = (15,10.5))
plot_count = 1for feature in list(data.columns)[:-1]:plt.subplot(3,3, plot_count)plt.scatter(data[feature], data['concrete_strength'])plt.xlabel(feature.replace('_',' ').title())plt.ylabel('Concrete strength')plot_count +=1plt.show()

plt.figure(figsize=(9,9))
corrmat = data.corr()
sns.heatmap(corrmat, vmax= 0.8, square = True, )
<matplotlib.axes._subplots.AxesSubplot at 0x213ddc4e7b8>

EDA总结:

  1. 数据相关性都不强,
  2. cement_component,water_component,superplasticizer,age似乎相关性高一点
  3. 由于特征都不多,可以分别用这四个特征以及所有特征尝试一遍
  4. 没有发现异常值
  5. 还没决定数据要不要标准化

3. model

实验内容:分别使用上面得到的特征,以及所有特征对混凝土强度做预测,同时使用不同的回归算法

from sklearn.model_selection import train_test_split
# 按数据集特征切割训练集测试集
def split_train_test(data, features=None, test_ratio=0.2):y = data['concrete_strength']if features != None:x = data[features]else:x = data.drop(['concrete_strength'], axis=1)train_x, test_x, train_y, test_y = train_test_split(x, y, test_size = test_ratio)return train_x, test_x, train_y, test_y
# 训练集,测试集
train_x, test_x, train_y, test_y = split_train_test(data, test_ratio = 0)
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_scorefrom sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVRfrom sklearn.metrics import r2_score
def data_cross_val(x,y, clfs, clfs_name, cv= 5):for i,clf in enumerate(clfs):scores = cross_val_score(estimator=clf, X= x, y= y, cv=cv, scoring ='r2')print(clfs_name[i])print('the R2 score: %f' %  np.mean(scores))

3.1 所有特征做回归

clfs = [LinearRegression(), Ridge(), Lasso(), ElasticNet(), GradientBoostingRegressor(), SVR()]
clfs_name = ['LinearRegression', 'Ridge', 'Lasso', 'ElasticNet', 'GradientBoostingRegressor', 'SVR']
data_cross_val(train_x, train_y, clfs,clfs_name, cv = 5)
LinearRegression
the R2 score: 0.604974
Ridge
the R2 score: 0.604974
Lasso
the R2 score: 0.605090
ElasticNet
the R2 score: 0.605220
GradientBoostingRegressor
the R2 score: 0.908837
SVR
the R2 score: 0.023249

结论:单一的回归器还是没有梯度提升机好,可以尝试用bagging和stacking的方式再实验一下,或者增加特征。

3.2 部分相关特征做回归

# 训练集,测试集
features = ['cement_component','water_component','superplasticizer','age']
train_x, test_x, train_y, test_y = split_train_test(data, features, test_ratio = 0)
clfs = [LinearRegression(), Ridge(), Lasso(), ElasticNet(), GradientBoostingRegressor(), SVR()]
clfs_name = ['LinearRegression', 'Ridge', 'Lasso', 'ElasticNet', 'GradientBoostingRegressor', 'SVR']
data_cross_val(train_x, train_y, clfs,clfs_name, cv = 5)
LinearRegression
the R2 score: 0.485046
Ridge
the R2 score: 0.485045
Lasso
the R2 score: 0.484828
ElasticNet
the R2 score: 0.484840
GradientBoostingRegressor
the R2 score: 0.830816
SVR
the R2 score: 0.043992

总结:目前来说使用部分相关的特征来做回归,由于特征数目太少,还不如用所有特征来的比较好

3.3 单线性回归

plt.figure(figsize=(15,7))
plot_count = 1for feature in ['cement_component', 'flay_ash', 'water_component', 'superplasticizer', 'coarse_aggregate']:data_tr = data[['concrete_strength', feature]]x_train, x_test, y_train, y_test = split_train_test(data_tr, [feature])# Create linear regression objectregr = LinearRegression()# Train the model using the training setsregr.fit(x_train, y_train)y_pred = regr.predict(x_test)# Plot outputsplt.subplot(2,3,plot_count)plt.scatter(x_test, y_test,  color='black')plt.plot(x_test, y_pred, color='blue',linewidth=3)plt.xlabel(feature.replace('_',' ').title())plt.ylabel('Concrete strength')print(feature, r2_score(y_test, y_pred))plot_count+=1plt.show()
cement_component 0.24550132796330282
flay_ash 0.012228585601186226
water_component 0.09828887425075417
superplasticizer 0.11471267678235075
coarse_aggregate 0.02046823335033021

features = ['cement_component', 'flay_ash', 'water_component', 'superplasticizer', 'coarse_aggregate']data_tr = data
data_tr=data_tr[(data_tr.T != 0).all()]x_train, x_test, y_train, y_test = split_train_test(data_tr, features)# Create linear regression object
regr = LinearRegression()# Train the model using the training sets
regr.fit(x_train, y_train)
y_pred = regr.predict(x_test)plt.scatter(range(len(y_test)), y_test,  color='black')
plt.plot(y_pred, color='blue', linewidth=3)print('Features: %s'%str(features))
print('R2 score: %f'%r2_score(y_test, y_pred))
print('Intercept: %f'%regr.intercept_)
print('Coefficients: %s'%str(regr.coef_))
Features: ['cement_component', 'flay_ash', 'water_component', 'superplasticizer', 'coarse_aggregate']
R2 score: 0.155569
Intercept: 84.481913
Coefficients: [ 0.04304209 -0.02577486 -0.1747249   0.15980663 -0.02633656]

alphas = np.arange(0.1,5,0.1)model = Ridge()
cv = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas))y_pred = cv.fit(x_train, y_train).predict(x_test)plt.scatter(range(len(y_test)), y_test,  color='black')
plt.plot(y_pred, color='blue', linewidth=3)print('Features: %s'%str(features))
print('R2 score: %f'%r2_score(y_test, y_pred))
print('Intercept: %f'%regr.intercept_)
print('Coefficients: %s'%str(regr.coef_))
Features: ['cement_component', 'flay_ash', 'water_component', 'superplasticizer', 'coarse_aggregate']
R2 score: 0.155562
Intercept: 84.481913
Coefficients: [ 0.04304209 -0.02577486 -0.1747249   0.15980663 -0.02633656]

model = Lasso()
cv = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas))y_pred = cv.fit(x_train, y_train).predict(x_test)plt.scatter(range(len(y_test)), y_test,  color='black')
plt.plot(y_pred, color='blue', linewidth=3)print('Features: %s'%str(features))
print('R2 score: %f'%r2_score(y_test, y_pred))
print('Intercept: %f'%regr.intercept_)
print('Coefficients: %s'%str(regr.coef_))
Features: ['cement_component', 'flay_ash', 'water_component', 'superplasticizer', 'coarse_aggregate']
R2 score: 0.151682
Intercept: 84.481913
Coefficients: [ 0.04304209 -0.02577486 -0.1747249   0.15980663 -0.02633656]

model = ElasticNet()
cv = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas))y_pred = cv.fit(x_train, y_train).predict(x_test)plt.scatter(range(len(y_test)), y_test,  color='black')
plt.plot(y_pred, color='blue', linewidth=3)print('Features: %s'%str(features))
print('R2 score: %f'%r2_score(y_test, y_pred))
print('Intercept: %f'%regr.intercept_)
print('Coefficients: %s'%str(regr.coef_))
Features: ['cement_component', 'flay_ash', 'water_component', 'superplasticizer', 'coarse_aggregate']
R2 score: 0.151796
Intercept: 84.481913
Coefficients: [ 0.04304209 -0.02577486 -0.1747249   0.15980663 -0.02633656]

plt.figure(figsize=(15,7))
plot_count = 1for feature in ['cement_component', 'flay_ash', 'water_component', 'superplasticizer', 'coarse_aggregate']:data_tr = data[['concrete_strength', feature]]data_tr=data_tr[(data_tr.T != 0).all()]x_train, x_test, y_train, y_test = split_train_test(data_tr, [feature])# Create linear regression objectregr = GradientBoostingRegressor()# Train the model using the training setsregr.fit(x_train, y_train)y_pred = regr.predict(x_test)# Plot outputsplt.subplot(2,3,plot_count)plt.scatter(x_test, y_test,  color='black')plt.plot(x_test, y_pred, color='blue',linewidth=3)plt.xlabel(feature.replace('_',' ').title())plt.ylabel('Concrete strength')print(feature, r2_score(y_test, y_pred))plot_count+=1plt.show()
cement_component 0.35248985320039705
flay_ash 0.17319875701989795
water_component 0.285023360910455
superplasticizer 0.19306275412216778
coarse_aggregate 0.17712532312647877

model = GradientBoostingRegressor()y_pred = model.fit(x_train, y_train).predict(x_test)plt.scatter(range(len(y_test)), y_test,  color='black')
plt.plot(y_pred, color='blue',linewidth=3)print('Features: %s'%str(features))
print('R2 score: %f'%r2_score(y_test, y_pred))
#print('Intercept: %f'%regr.intercept_)
#print('Coefficients: %s'%str(regr.coef_))
Features: ['cement_component', 'flay_ash', 'water_component', 'superplasticizer', 'coarse_aggregate']
R2 score: 0.177125

plt.figure(figsize=(15,7))
plot_count = 1for feature in ['cement_component', 'flay_ash', 'water_component', 'superplasticizer', 'coarse_aggregate']:data_tr = data[['concrete_strength', feature]]data_tr=data_tr[(data_tr.T != 0).all()]x_train, x_test, y_train, y_test = split_train_test(data_tr, [feature])# Create linear regression objectregr = SVR(kernel='linear')# Train the model using the training setsregr.fit(x_train, y_train)y_pred = regr.predict(x_test)# Plot outputsplt.subplot(2,3,plot_count)plt.scatter(x_test, y_test,  color='black')plt.plot(x_test, y_pred, color='blue', linewidth=3)plt.xlabel(feature.replace('_',' ').title())plt.ylabel('Concrete strength')print(feature, r2_score(y_test, y_pred))plot_count+=1plt.show()
cement_component 0.2054832593541437
flay_ash -0.044636249705873654
water_component 0.07749271320026574
superplasticizer 0.0671220299245393
coarse_aggregate 0.016036478490831563

model = SVR(kernel='linear')y_pred = model.fit(x_train, y_train).predict(x_test)plt.scatter(range(len(y_test)), y_test,  color='black')
plt.plot(y_pred, color='blue', linewidth=3)print('Features: %s'%str(features))
print('R2 score: %f'%r2_score(y_test, y_pred))
Features: ['cement_component', 'flay_ash', 'water_component', 'superplasticizer', 'coarse_aggregate']
R2 score: 0.016036

4. 使用 cement_component和 water_component预测concrete_strength

feature = 'cement_component'
cc_new_data = np.array([[213.5]])data_tr = data[['concrete_strength', feature]]
data_tr=data_tr[(data_tr.T != 0).all()]x_train, x_test, y_train, y_test = split_train_test(data_tr, [feature])regr = GradientBoostingRegressor()# Train the model using the training setsregr.fit(x_train, y_train)
cs_pred = regr.predict(cc_new_data)
print('Predicted value of concrete strength: %f'%cs_pred)
Predicted value of concrete strength: 36.472380
feature = 'water_component'
wc_new_data = np.array([[200]])data_tr = data[['concrete_strength', feature]]
data_tr=data_tr[(data_tr.T != 0).all()]x_train, x_test, y_train, y_test = split_train_test(data_tr, [feature])regr = GradientBoostingRegressor()# Train the model using the training sets
regr.fit(x_train, y_train)
cs_pred = regr.predict(wc_new_data)
print('Predicted value of concrete strength: %f'%cs_pred)
Predicted value of concrete strength: 32.648425

转载于:https://www.cnblogs.com/llssx/p/10612940.html

回归分析过程实例(练习)相关推荐

  1. cadence spb 16.5 破解过程实例和使用感受_赤松子耶_新浪博客

    cadence spb 16.5 破解过程实例和使用感受_赤松子耶_新浪博客 Cadence Allegro16.5详细安装具体的步骤 1.下载SPB16.5下来后,点setup.exe,先安装第一项 ...

  2. 2021-04-27 Android 理解frameworks services jni hardware kernel 整个控制过程实例包括回调

    Android 理解frameworks services jni hardware kernel 整个控制过程实例包括回调 一.这个例子的实现的功能是,app控制power pin和control ...

  3. R语言与回归分析计算实例

    6.1.7 计算实例 这里用Forbes数据为例,全面展示一元回归模型的计算过程. 例 6.5 Forbes数据 在十九世纪四.五十年代,苏格兰物理学家James D. Forbes,试图通过水的沸点 ...

  4. 函数调用过程实例详解

    原文标题:<函数调用过程探究> 引言 如何定义函数.调用函数,是每个程序员学习编程的入门课.调用函数(caller)向被调函数(callee)传入参数,被调函数返回结果,看似简单的过程,其 ...

  5. 详细BP神经网络预测算法及实现过程实例

    1.具体应用实例.根据表2,预测序号15的跳高成绩. 表2 国内男子跳高运动员各项素质指标 序号 跳高成绩() 30行进跑(s) 立定三级跳远() 助跑摸高() 助跑4-6步跳高() 负重深蹲杠铃() ...

  6. 理解热插拔技术:热插拔保护电路设计过程实例

    服务器.网络交换机.冗余存储磁盘阵列(RAID),以及其它形式的通信基础设施等高可用性系统,需要在整个使用生命周期内具有接 近零的停机率.如果这种系统的一个部件发生了故障或是需要升级,它必须在不中断系 ...

  7. R语言与多元线性回归分析计算实例

    6.3.7 计算实例 例 6.9 某大型牙膏制造企业为了更好地拓展产品市场,有效地管理库存,公司董事会要求销售部门根据市场调查,找出公司生产的牙膏销售量与销售价格,广告投入等之间的关系,从而预测出在不 ...

  8. SSL握手过程实例抓包分析

    为了更好理解SSL协议的握手过程,结合实例,使用Wireshark抓包分析SSL握手过程中客户端与服务器间的交互过程.本例中服务器为AcFun弹幕视频网 - 认真你就输啦 (・ω・)ノ- ( ゜- ゜ ...

  9. TDD系列3-TDD过程实例-保龄球单局积分算法

    认识了TDD,我们以实际案例过程来更好的学习TDD. 案例需求 保龄球单局积分规则为: 1.保龄球按顺序每轮允许投2个球,投完10轮为1局. 2.每击倒1个瓶得1分.投完一轮将两个球的"所得 ...

最新文章

  1. matplotlib subplot画子图
  2. java list 遍历 删除元素_java中List遍历删除元素相关做法和注意事项
  3. nginx学习笔记(7)Nginx如何处理一个请求---转载
  4. 计算机进桌面后反复重启,我的电脑一插网线就自动重启。到界面之后又马上重启。一直循环。...
  5. 可视化太酷辽!一文了解排序和搜索算法在前端中的应用
  6. 本地 Windows 如何将 Web 工程部署到远程 Windows 主机上
  7. c++函数返回值是一个引用
  8. curl 伪装来路(referer)
  9. Kafka 源码分析之网络层(二)
  10. Kafka的架构设计
  11. 常用Latex表达式符号——组合数学篇
  12. 飞猪:国庆乡村民宿订单量涨560% 00后红色旅游订单量同比涨80%
  13. 进程调度的时机,切换与过程,方式
  14. BZOJ1251序列终结者——非旋转treap
  15. js排序算法详解-归并排序
  16. 获取单选按钮选中的值
  17. 表示计算机运行快的词,形容电脑打字快的成语_四字词语 - 成梦词典
  18. 三门问题与神奇的贝叶斯大脑
  19. TestNG教程二:testNG常用测试类型
  20. 欧几里德结构数据(Euclidean Structure Data) 以及非欧几里德结构数据(Non-Euclidean Structure Data)

热门文章

  1. Newstar Ctf 2022| week2 wp
  2. 逃离996的最强出路!
  3. 解密!高德地图九大绝密卷宗带你畅游上海迪士尼
  4. 使用mathtype后,word自动生成目录中出现“Equation Chapter (Next) Section 1字样”
  5. Android端集成大疆SDK(MSDK)
  6. mysql修改初始密码....
  7. RxJS结合vue-rx, Akita的介绍和使用
  8. Linux 命令行操作 while read ; cut 提取指定列; uniq命令并计数;sort 命令倒序查找
  9. lol服务器维护一天,lol维护一天一夜,你就给我三胜经验补偿卡?
  10. 刘兵《Entity and aspect extraction for opinion mining 》翻译笔记