1.1 导入相应的包和数据

%matplotlib inline
#在jupyter里面需要加入此命令显示图import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import RANSACRegressor, LinearRegression, TheilSenRegressor
from sklearn.metrics import explained_variance_score, mean_absolute_error, mean_squared_error, median_absolute_error, r2_score
from sklearn.svm import SVR
from sklearn.linear_model import Ridge,Lasso,ElasticNet,BayesianRidge
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.cross_validation import train_test_splitdata = pd.read_csv('../cement_data.csv')
# 查看数据记录的长度,共1030行
print(len(data))
# 查看前五行数据
data.head()

数据展示如下:

重新为列标签命名:

data.columns = ['cement_component', 'furnace_slag', 'flay_ash', 'water_component', 'superplasticizer', 'coarse_aggregate', 'fine_aggregate', 'age', 'concrete_strength']data.head()

1.2 特征探索

先用可视化方法查看各个变量分别和concrete_strength的关系,结果可看到很多的自变量都存在大量0值,忽略0值看到cement_component、superplasticizer和concrete strength呈正相关关系,Flay_Ash、water component、coarse aggregate、fine aggregate和concrete strength呈负相关关系,age和concrete strength没有明显的关系,而且年份呈现离散趋势。

plt.figure(figsize=(15,10.5))
plot_count = 1
for feature in list(data.columns)[:-1]:plt.subplot(3,3,plot_count)plt.scatter(data[feature], data['concrete_strength'])plt.xlabel(feature.replace('_',' ').title())plt.ylabel('Concrete strength')plot_count+=1plt.show()

接下来对年份进行分段,查看每个年份段下各个特征跟因变量之间的pearson相关系数。从上面的年份图可以看到年份大致在100以下、100-300、300以上,所以进行以下的年份区分,并加入age_level列。

data.loc[data['age'] <= 30, 'age_level'] = '<30'
data.loc[((data['age'] <= 100) & (data['age'] > 30)), 'age_level'] = '30<age<100'
data.loc[((data['age'] <= 300) & (data['age'] > 100)), 'age_level'] = '100<age<300'
data.loc[data['age'] > 300, 'age_level'] = 'age>300'data.head(20)

对比未对年份分组的pearson系数和对年份分组的pearson系数。未对年份分组的pearson系数表中显示,cement_component、superplasticizer、furnace_slag和concrete_strength呈现正相关关系,water_component、coarse_aggregate、fine_aggregate、flay_ash和concrete_strength呈现负相关关系。

对年份分组的pearson系数表中显示,在100

all_correlations = data.corr(method='pearson')
print(all_correlations)print('---------------------------------------------------------------------------------------')column = ['cement_component', 'furnace_slag', 'flay_ash', 'water_component', 'superplasticizer', 'coarse_aggregate', 'fine_aggregate', 'age_level', 'concrete_strength']
# 按age_level分组求pearson相关系数
correlations = data[column].groupby('age_level').corr(method='pearson')
print(correlations)
                   cement_component  furnace_slag  flay_ash  water_component  superplasticizer  \
cement_component              1.000        -0.275    -0.397           -0.082             0.092
furnace_slag                 -0.275         1.000    -0.324            0.107             0.043
flay_ash                     -0.397        -0.324     1.000           -0.257             0.378
water_component              -0.082         0.107    -0.257            1.000            -0.658
superplasticizer              0.092         0.043     0.378           -0.658             1.000
coarse_aggregate             -0.109        -0.284    -0.010           -0.182            -0.266
fine_aggregate               -0.223        -0.282     0.079           -0.451             0.223
age                           0.082        -0.044    -0.154            0.278            -0.193
concrete_strength             0.498         0.135    -0.106           -0.290             0.366   coarse_aggregate  fine_aggregate    age  concrete_strength
cement_component             -0.109          -0.223  0.082              0.498
furnace_slag                 -0.284          -0.282 -0.044              0.135
flay_ash                     -0.010           0.079 -0.154             -0.106
water_component              -0.182          -0.451  0.278             -0.290
superplasticizer             -0.266           0.223 -0.193              0.366
coarse_aggregate              1.000          -0.178 -0.003             -0.165
fine_aggregate               -0.178           1.000 -0.156             -0.167
age                          -0.003          -0.156  1.000              0.329
concrete_strength            -0.165          -0.167  0.329              1.000
--------------------------------------------------------------------------------------------cement_component  coarse_aggregate  concrete_strength  \
age_level
100<age<300 cement_component              1.000             0.544              0.558   coarse_aggregate              0.544             1.000              0.481   concrete_strength             0.558             0.481              1.000   fine_aggregate               -0.509             0.033             -0.575   flay_ash                        NaN               NaN                NaN   furnace_slag                 -0.595            -0.390             -0.034   superplasticizer                NaN               NaN                NaN   water_component              -0.204            -0.782             -0.063
30<age<100  cement_component              1.000            -0.468              0.565   coarse_aggregate             -0.468             1.000             -0.282   concrete_strength             0.565            -0.282              1.000   fine_aggregate               -0.213            -0.254             -0.179   flay_ash                     -0.491             0.450             -0.216   furnace_slag                  0.019            -0.306              0.443   superplasticizer              0.337            -0.178              0.617   water_component              -0.057            -0.105             -0.472
<30         cement_component              1.000            -0.057              0.534   coarse_aggregate             -0.057             1.000             -0.227   concrete_strength             0.534            -0.227              1.000   fine_aggregate               -0.178            -0.199             -0.203   flay_ash                     -0.369            -0.127             -0.092   furnace_slag                 -0.323            -0.274              0.138   superplasticizer              0.070            -0.316              0.383   water_component              -0.157            -0.185             -0.325
age>300     cement_component              1.000            -0.378              0.095   coarse_aggregate             -0.378             1.000             -0.319   concrete_strength             0.095            -0.319              1.000   fine_aggregate               -0.462             0.683             -0.560   flay_ash                        NaN               NaN                NaN   furnace_slag                 -0.569            -0.081              0.342   superplasticizer                NaN               NaN                NaN   water_component               0.378            -0.794              0.631   fine_aggregate  flay_ash  furnace_slag  superplasticizer  \
age_level
100<age<300 cement_component           -0.509       NaN        -0.595               NaN   coarse_aggregate            0.033       NaN        -0.390               NaN   concrete_strength          -0.575       NaN        -0.034               NaN   fine_aggregate              1.000       NaN        -0.324               NaN   flay_ash                      NaN       NaN           NaN               NaN   furnace_slag               -0.324       NaN         1.000               NaN   superplasticizer              NaN       NaN           NaN               NaN   water_component            -0.558       NaN         0.498               NaN
30<age<100  cement_component           -0.213    -0.491         0.019             0.337   coarse_aggregate           -0.254     0.450        -0.306            -0.178   concrete_strength          -0.179    -0.216         0.443             0.617   fine_aggregate              1.000     0.122        -0.309             0.285   flay_ash                    0.122     1.000        -0.547             0.102   furnace_slag               -0.309    -0.547         1.000             0.067   superplasticizer            0.285     0.102         0.067             1.000   water_component            -0.445    -0.323         0.065            -0.793
<30         cement_component           -0.178    -0.369        -0.323             0.070   coarse_aggregate           -0.199    -0.127        -0.274            -0.316   concrete_strength          -0.203    -0.092         0.138             0.383   fine_aggregate              1.000     0.011        -0.288             0.151   flay_ash                    0.011     1.000        -0.298             0.412   furnace_slag               -0.288    -0.298         1.000             0.029   superplasticizer            0.151     0.412         0.029             1.000   water_component            -0.374    -0.163         0.123            -0.584
age>300     cement_component           -0.462       NaN        -0.569               NaN   coarse_aggregate            0.683       NaN        -0.081               NaN   concrete_strength          -0.560       NaN         0.342               NaN   fine_aggregate              1.000       NaN        -0.419               NaN   flay_ash                      NaN       NaN           NaN               NaN   furnace_slag               -0.419       NaN         1.000               NaN   superplasticizer              NaN       NaN           NaN               NaN   water_component            -0.943       NaN         0.364               NaN   water_component
age_level
100<age<300 cement_component            -0.204  coarse_aggregate            -0.782  concrete_strength           -0.063  fine_aggregate              -0.558  flay_ash                       NaN  furnace_slag                 0.498  superplasticizer               NaN  water_component              1.000
30<age<100  cement_component            -0.057  coarse_aggregate            -0.105  concrete_strength           -0.472  fine_aggregate              -0.445  flay_ash                    -0.323  furnace_slag                 0.065  superplasticizer            -0.793  water_component              1.000
<30         cement_component            -0.157  coarse_aggregate            -0.185  concrete_strength           -0.325  fine_aggregate              -0.374  flay_ash                    -0.163  furnace_slag                 0.123  superplasticizer            -0.584  water_component              1.000
age>300     cement_component             0.378  coarse_aggregate            -0.794  concrete_strength            0.631  fine_aggregate              -0.943  flay_ash                       NaN  furnace_slag                 0.364  superplasticizer               NaN  water_component              1.000

接下来查看所有变量之间的相关关系图

data_ = data[(data.T != 0).any()]
seaborn.pairplot(data_, vars=data.columns, kind='reg')
plt.show()

1.3 回归分析

建立split_train_test()函数划分数据

def split_train_test(data, feature, train_index=0.7):train, test = train_test_split(data, test_size = 1-train_index)if type(feature) == list:x_train = train[feature].as_matrix()y_train = train['concrete_strength'].as_matrix()x_test = test[feature].as_matrix()y_test = test['concrete_strength'].as_matrix()else:x_train = [[x] for x in list(train[feature])]y_train = [[x] for x in list(train['concrete_strength'])]x_test = [[x] for x in list(test[feature])]y_test = [[x] for x in list(test['concrete_strength'])]return x_train, y_train, x_test, y_test

由单变量线性回归可视化可知,cement_component(0.227)、superplasticizer(0.0129)和concrete_strength呈现正相关线性趋势,flay_ash(0.0237), water_component(0.0727), coarse_aggregate(0.0129)和concrete_strength呈现负相关线性趋势。

plt.figure(figsize=(15,7))
plot_count = 1for feature in ['cement_component', 'flay_ash', 'water_component', 'superplasticizer', 'coarse_aggregate']:data_tr = data[['concrete_strength', feature]]data_tr=data_tr[(data_tr.T != 0).all()]x_train, y_train, x_test, y_test = split_train_test(data_tr, feature)# Create linear regression objectregr = LinearRegression()# Train the model using the training setsregr.fit(x_train, y_train)y_pred = regr.predict(x_test)# Plot outputsplt.subplot(2,3,plot_count)plt.scatter(x_test, y_test,  color='black')plt.plot(x_test, y_pred, color='blue',linewidth=3)plt.xlabel(feature.replace('_',' ').title())plt.ylabel('Concrete strength')print(feature, r2_score(y_test, y_pred))plot_count+=1plt.show()

cement_component 0.22709501673033738
flay_ash 0.02372873998753655
water_component 0.07274737892115468
superplasticizer 0.01293229609021429
coarse_aggregate 0.012992870179391658

1.4 多变量回归分析

1.4.1 线性回归
features = ['cement_component', 'flay_ash', 'water_component', 'superplasticizer', 'coarse_aggregate']data_tr = data
data_tr=data_tr[(data_tr.T != 0).all()]x_train, y_train, x_test, y_test = split_train_test(data_tr, features)# Create linear regression object
regr = LinearRegression()# Train the model using the training sets
regr.fit(x_train, y_train)
y_pred = regr.predict(x_test)plt.scatter(range(len(y_test)), y_test,  color='black')
plt.plot(y_pred, color='blue', linewidth=3)print('Features: %s'%str(features))
print('R2 score: %f'%r2_score(y_test, y_pred))
print('Intercept: %f'%regr.intercept_)
print('Coefficients: %s'%str(regr.coef_))

Features: [‘cement_component’, ‘flay_ash’, ‘water_component’, ‘superplasticizer’, ‘coarse_aggregate’]
R2 score: 0.114955
Intercept: 56.893169
Coefficients: [ 0.0502359 -0.03243765 -0.12711574 0.42090465 -0.0092923 ]

1.4.2 Ridge回归
alphas = np.arange(0.1,5,0.1)model = Ridge()
cv = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas))y_pred = cv.fit(x_train, y_train).predict(x_test)plt.scatter(range(len(y_test)), y_test,  color='black')
plt.plot(y_pred, color='blue', linewidth=3)print('Features: %s'%str(features))
print('R2 score: %f'%r2_score(y_test, y_pred))
print('Intercept: %f'%regr.intercept_)
print('Coefficients: %s'%str(regr.coef_))

Features: [‘cement_component’, ‘flay_ash’, ‘water_component’, ‘superplasticizer’, ‘coarse_aggregate’]
R2 score: 0.115025
Intercept: 56.893169
Coefficients: [ 0.0502359 -0.03243765 -0.12711574 0.42090465 -0.0092923 ]

1.4.3 Lasso回归
model = Lasso()
cv = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas))y_pred = cv.fit(x_train, y_train).predict(x_test)plt.scatter(range(len(y_test)), y_test,  color='black')
plt.plot(y_pred, color='blue', linewidth=3)print('Features: %s'%str(features))
print('R2 score: %f'%r2_score(y_test, y_pred))
print('Intercept: %f'%regr.intercept_)
print('Coefficients: %s'%str(regr.coef_))

Features: [‘cement_component’, ‘flay_ash’, ‘water_component’, ‘superplasticizer’, ‘coarse_aggregate’]
R2 score: 0.129458
Intercept: 56.893169
Coefficients: [ 0.0502359 -0.03243765 -0.12711574 0.42090465 -0.0092923 ]

1.4.4 ElasticNet回归
model = ElasticNet()
cv = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas))y_pred = cv.fit(x_train, y_train).predict(x_test)plt.scatter(range(len(y_test)), y_test,  color='black')
plt.plot(y_pred, color='blue', linewidth=3)print('Features: %s'%str(features))
print('R2 score: %f'%r2_score(y_test, y_pred))
print('Intercept: %f'%regr.intercept_)
print('Coefficients: %s'%str(regr.coef_))

Features: [‘cement_component’, ‘flay_ash’, ‘water_component’, ‘superplasticizer’, ‘coarse_aggregate’]
R2 score: 0.126087
Intercept: 56.893169
Coefficients: [ 0.0502359 -0.03243765 -0.12711574 0.42090465 -0.0092923 ]

1.4.5 GradientBoostingRegressor单变量回归
plt.figure(figsize=(15,7))
plot_count = 1for feature in ['cement_component', 'flay_ash', 'water_component', 'superplasticizer', 'coarse_aggregate']:data_tr = data[['concrete_strength', feature]]data_tr=data_tr[(data_tr.T != 0).all()]x_train, y_train, x_test, y_test = split_train_test(data_tr, feature)# Create linear regression objectregr = GradientBoostingRegressor()# Train the model using the training setsregr.fit(x_train, y_train)y_pred = regr.predict(x_test)# Plot outputsplt.subplot(2,3,plot_count)plt.scatter(x_test, y_test,  color='black')plt.plot(x_test, y_pred, color='blue',linewidth=3)plt.xlabel(feature.replace('_',' ').title())plt.ylabel('Concrete strength')print(feature, r2_score(y_test, y_pred))plot_count+=1plt.show()

cement_component 0.29991407621280963
flay_ash 0.07501932678751821
water_component 0.33285235447360906
superplasticizer 0.1270301345197723
coarse_aggregate 0.2679084164701997

1.4.6 GradientBoostingRegressor多变量回归
model = GradientBoostingRegressor()y_pred = model.fit(x_train, y_train).predict(x_test)plt.scatter(range(len(y_test)), y_test,  color='black')
plt.plot(y_pred, color='blue',linewidth=3)print('Features: %s'%str(features))
print('R2 score: %f'%r2_score(y_test, y_pred))
print('Intercept: %f'%regr.intercept_)
print('Coefficients: %s'%str(regr.coef_))

Features: [‘cement_component’, ‘flay_ash’, ‘water_component’, ‘superplasticizer’, ‘coarse_aggregate’]
R2 score: -0.089525
Intercept: 81.404002
Coefficients: [ 0.0523122 -0.00354028 -0.16425187 0.1049935 -0.03001721]

1.4.7 SVR回归
model = SVR(kernel='linear')y_pred = model.fit(x_train, y_train).predict(x_test)plt.scatter(range(len(y_test)), y_test,  color='black')
plt.plot(y_pred, color='blue', linewidth=3)print('Features: %s'%str(features))
print('R2 score: %f'%r2_score(y_test, y_pred))

Features: [‘cement_component’, ‘flay_ash’, ‘water_component’, ‘superplasticizer’, ‘coarse_aggregate’]
R2 score: 0.029033

1.5 回归预测

通过cement component预测concrete strength得到当cement component=213.5时,concrete strength=37.198606.

feature = 'cement_component'
cc_new_data = 213.5data_tr = data[['concrete_strength', feature]]
data_tr=data_tr[(data_tr.T != 0).all()]x_train, y_train, x_test, y_test = split_train_test(data_tr, feature)regr = GradientBoostingRegressor()# Train the model using the training setsregr.fit(x_train, y_train)
cs_pred = regr.predict(cc_new_data)
print('Predicted value of concrete strength: %f'%cs_pred)

Predicted value of concrete strength: 37.198606

通过water_component预测concrete strength得到当water_component=213.5时,concrete strength=33.020739.

feature = 'water_component'
wc_new_data = 200data_tr = data[['concrete_strength', feature]]
data_tr=data_tr[(data_tr.T != 0).all()]x_train, y_train, x_test, y_test = split_train_test(data_tr, feature)regr = GradientBoostingRegressor()# Train the model using the training sets
regr.fit(x_train, y_train)
cs_pred = regr.predict(wc_new_data)
print('Predicted value of concrete strength: %f'%cs_pred)

Predicted value of concrete strength: 33.020739

Python进行数据分析探索相关推荐

  1. Learning: 利用Python进行数据分析 - MovieLens 数据集的探索

    MovieLens 1M数据集含有来自6000名用户对4000部电影的100万条评分数据,分为三个表,movies.ratings.users 数据处理 通过pandas.read_table将各表转 ...

  2. 牛!大佬原创的《Python 与数据分析 100 个案例》PDF 可以下载了

    告别枯燥,通过学习有趣的小案例,扎实而系统的入门 Python.数据分析.机器学习,从菜鸟到大师,个人觉得这是很靠谱的一种方法. 通过一个又一个的案例,真正领悟 Python 的强大和简洁,真正做到高 ...

  3. 第2章 Python与数据分析

    <Python数据分析基础教程>学习笔记. 第2章 Python与数据分析 2.1 Python数据分析常用的类库 类库是用来实现各种功能的类的集合. -1. NumPy NumPy(Nu ...

  4. python数据分析方法和命令_《利用Python进行数据分析》 —— (1)

    <利用Python进行数据分析> -- (1) Python的学习需要自主探索各种类型,函数和方法的文档. 2.1 Python解释器 在IPython(Jupyter Qtconsole ...

  5. 利用python进行数据分析——第13章 python建模库介绍

    文章目录 一.pandas与建模代码的结合 二.使用patsy创建模型描述 2.1Patsy公式中的数据转换 2.2分类数据与Patsy 三.statsmodels介绍 3.1评估线性模型 3.2评估 ...

  6. 利用Python进行数据分析笔记-pandas建模(statsmodels篇)

    跟着教程学习了一段时间数据分析,越学感觉坑越多.于是花了一个星期仔细看了下<利用Python进行数据分析>.写在这里主要是记录下,方便自己查看. statsmodels简介 statsmo ...

  7. 利用 Python 进行数据分析 (一):IPython 及 Jupyter notebook

    本文为<利用 Python 进行数据分析>的读书笔记 目录 IPython 与 Jupyter notebook 简介 IPython 基础 使用 IPython 命令行 运行 Jupyt ...

  8. 《利用python进行数据分析》读书笔记

    <利用python进行数据分析>是一本利用python的Numpy.Pandas.Matplotlib库进行数据分析的基础介绍,非常适合初学者. 重要的python库 NumPy http ...

  9. 利用Python进行数据分析(Ⅴ)

    利用Python进行数据分析(Ⅴ) 本文参考书籍:<利用Python进行数据分析> 目录 利用Python进行数据分析(Ⅴ) 13.Python建模库介绍 13.1 pandas与建模代码 ...

最新文章

  1. 关于Mybaits,我总结了10种通用的写法
  2. 中国航信官笔试计算机基础,中国航信笔试题目
  3. CCNA-网络常用工具介绍篇
  4. 微软官方Windows主题 英国之美 高分辨率的壁纸
  5. MyEclipse 启动之 java.lang.RuntimeException: No application id has been
  6. 拖拽之路(二):自定义QListWidget实现美观的拖拽样式(拖拽不影响选中)
  7. mysql5.5数据备份_MySql5.5备份和还原
  8. main方法中args_public static void main(String [] args)– Java main方法
  9. 树莓派学习路程No.2 GPIO功能初识 wiringPi安装
  10. Apache Hadoop YARN
  11. GJB 5000A与GJB 5000B区别
  12. The server encountered an internal error () that prevented it from fulfilling this request.
  13. html表格中字与字间距如何调整,excel表格字间距怎么调
  14. 析测结Trimble TILOS v9.0 1CD
  15. LI雨骤Moku:Go M1初步体验
  16. Delphi 微信支付接口AEAD_AES_256_GCM解密
  17. csharp基础练习题:TO DE-RY-PO-陆琪暗号【难度:1级】--景越C#经典编程题库,不同难度C#练习题,适合自学C#的新手进阶训练
  18. Windows XP Professional 32位 MSDN原版
  19. 那些年,我们用过最好的视频播放器
  20. 四位行波进位加法器_【HDL系列】硬件加法器原理与设计小结

热门文章

  1. 深入分析以太坊合并后的监管和应用层问题
  2. Doris的安装部署
  3. 什么是Hibernate方言和方言列表
  4. Bond Graph;端口Hamilton系统
  5. 大数据智能平台的构建策略与步骤
  6. fluent并行 linux_Fluent17.2在基于Linux下PC集群的并行计算.pdf
  7. Thinging in Java读后总结
  8. 【presto】presto查询报错整理
  9. hdmi接口线_买了新电视画质还是不清晰?可能是你买错了HDMI线!7点教你避坑
  10. 武汉移动137和武汉电信189手机业务比较