


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('D:\\Py_dataset\\winequality-red.csv',sep = ';')
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.998 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.997 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.997 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.998 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.998 3.51 0.56 9.4 5


No 属性 数据类型 字段描述
1 fixed acidity Numeric 非挥发性酸
2 volatile acidity Numeric 挥发性酸
3 citric acid Numeric 柠檬酸
4 residual sugar Numeric 残糖
5 chlorides Numeric 氯化物
6 free sulfur dioxide Numeric 游离二氧化硫
7 total sulfur dioxide Numeric 总二氧化硫
8 density Numeric 密度
9 pH Numeric 酸碱度
10 sulphates Numeric 硫酸盐
11 alcohol Numeric 酒精
12 quality (score between 0 and 10) Numeric 葡萄酒质量(1-10之间)


df.shape #  (1599, 12)
df.info() # 没有缺失值
'''#   Column                Non-Null Count  Dtype
---  ------                --------------  -----  0   fixed acidity         1599 non-null   float641   volatile acidity      1599 non-null   float642   citric acid           1599 non-null   float643   residual sugar        1599 non-null   float644   chlorides             1599 non-null   float645   free sulfur dioxide   1599 non-null   float646   total sulfur dioxide  1599 non-null   float647   density               1599 non-null   float648   pH                    1599 non-null   float649   sulphates             1599 non-null   float6410  alcohol               1599 non-null   float6411  quality               1599 non-null   int64
count mean std min 25% 50% 75% max
fixed acidity 1599.0 8.320 1.741 4.600 7.100 7.900 9.200 15.900
volatile acidity 1599.0 0.528 0.179 0.120 0.390 0.520 0.640 1.580
citric acid 1599.0 0.271 0.195 0.000 0.090 0.260 0.420 1.000
residual sugar 1599.0 2.539 1.410 0.900 1.900 2.200 2.600 15.500
chlorides 1599.0 0.087 0.047 0.012 0.070 0.079 0.090 0.611
free sulfur dioxide 1599.0 15.875 10.460 1.000 7.000 14.000 21.000 72.000
total sulfur dioxide 1599.0 46.468 32.895 6.000 22.000 38.000 62.000 289.000
density 1599.0 0.997 0.002 0.990 0.996 0.997 0.998 1.004
pH 1599.0 3.311 0.154 2.740 3.210 3.310 3.400 4.010
sulphates 1599.0 0.658 0.170 0.330 0.550 0.620 0.730 2.000
alcohol 1599.0 10.423 1.066 8.400 9.500 10.200 11.100 14.900
quality 1599.0 5.636 0.808 3.000 5.000 6.000 6.000 8.000


# 设置调色板
color = sns.color_palette()
column= df.columns.tolist()
fig = plt.figure(figsize = (10,8))
for i in range(12):plt.subplot(4,3,i+1)df[column[i]].hist(bins = 100,color = color[3])plt.xlabel(column[i],fontsize = 12)plt.ylabel('Frequency',fontsize = 12)


fig = plt.figure(figsize = (10,8))
for i in range(12):plt.subplot(4,3,i+1)sns.boxplot(df[column[i]],orient = 'v',width = 0.5,color = color[4])plt.ylabel(column[i],fontsize = 12)


该数据集与酸度相关的特征有’fixed acidity’, ‘volatile acidity’, ‘citric acid’,‘chlorides’, ‘free sulfur dioxide’, ‘total sulfur dioxide’,‘PH’。其中前6中酸度特征都会对PH产生影响。PH在对数尺度,然后对6中酸度取对数做直方图。

acidityfeat = ['fixed acidity', 'volatile acidity', 'citric acid', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide',]fig = plt.figure(figsize = (10,6))
for i in range(6):plt.subplot(2,3,i+1)v = np.log10(np.clip(df[acidityfeat[i]].values,a_min = 0.001,a_max = None))plt.hist(v,bins = 50,color = color[0])plt.xlabel('log('+ acidityfeat[i] +')',fontsize = 12)plt.ylabel('Frequency')

plt.figure(figsize = (6,3))bins = 10**(np.linspace(-2,2))
plt.hist(df['fixed acidity'],bins = bins, edgecolor = 'k',label = 'fixed acidity')
plt.hist(df['volatile acidity'],bins = bins, edgecolor = 'k',label = 'volatile acidity')
plt.hist(df['citric acid'],bins = bins, alpha = 0.8,edgecolor = 'k',label = 'citric acid')plt.xscale('log')
plt.xlabel('Acid concentration(g/dm^3)')
plt.title('Historgram of Acid Concentration')

count mean std min 25% 50% 75% max
fixed acidity 1599.0 8.320 1.741 4.600 7.100 7.900 9.200 15.900
volatile acidity 1599.0 0.528 0.179 0.120 0.390 0.520 0.640 1.580
citric acid 1599.0 0.271 0.195 0.000 0.090 0.260 0.420 1.000
residual sugar 1599.0 2.539 1.410 0.900 1.900 2.200 2.600 15.500
chlorides 1599.0 0.087 0.047 0.012 0.070 0.079 0.090 0.611
free sulfur dioxide 1599.0 15.875 10.460 1.000 7.000 14.000 21.000 72.000
total sulfur dioxide 1599.0 46.468 32.895 6.000 22.000 38.000 62.000 289.000
density 1599.0 0.997 0.002 0.990 0.996 0.997 0.998 1.004
pH 1599.0 3.311 0.154 2.740 3.210 3.310 3.400 4.010
sulphates 1599.0 0.658 0.170 0.330 0.550 0.620 0.730 2.000
alcohol 1599.0 10.423 1.066 8.400 9.500 10.200 11.100 14.900
quality 1599.0 5.636 0.808 3.000 5.000 6.000 6.000 8.000


residual sugar主要与酒的甜度有关,干红(<= 4g/L),半干(4-12g/L),半甜(12-45g/L),甜(>= 45g/L),该数据集中没有甜葡萄酒。

df['sweetness'] = pd.cut(df['residual sugar'],bins = [0,4,12,45],labels = ['dry','semi-dry','semi-sweet'])
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality sweetness
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.998 3.51 0.56 9.4 5 dry
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.997 3.20 0.68 9.8 5 dry
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.997 3.26 0.65 9.8 5 dry
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.998 3.16 0.58 9.8 6 dry
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.998 3.51 0.56 9.4 5 dry
plt.figure(figsize = (6,4))
df['sweetness'].value_counts().plot(kind = 'bar',color = color[0])
plt.xticks(rotation = 0)
print('Figure 5')

# 创建一个新特征total acid
df['total acid'] = df['fixed acidity'] + df['volatile acidity'] + df['citric acid']columns = df.columns.tolist()
columns['fixed acidity','volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide','total sulfur dioxide','density','pH','sulphates','alcohol','quality','total acid']
sns.set_context('notebook',font_scale = 1.1)column = columns[0:11] + ['total acid']
plt.figure(figsize = (10,8))
for i in range(12):plt.subplot(4,3,i+1)sns.boxplot(x = 'quality',y = column[i], data = df,color = color[1],width = 0.6)plt.ylabel(column[i],fontsize = 12)
plt.tight_layout()print('Figure 7:PhysicoChemico Propertise and Wine Quality by Boxplot')


  • 红酒品质与柠檬酸,硫酸盐,酒精度成正相关
  • 红酒品质与易挥发性酸,密度,PH成负相关
  • 残留糖分,氯离子,二氧化硫对红酒品质没有什么影响
plt.figure(figsize = (10,8))
mcorr = df[column].corr()
mask = np.zeros_like(mcorr,dtype = np.bool)
mask[np.triu_indices_from(mask)] = True
cmap = sns.diverging_palette(220, 10, as_cmap=True)
g = sns.heatmap(mcorr, mask=mask, cmap=cmap, square=True, annot=True, fmt='0.2f')print('Figure 8:Pairwise colleration plot')



sns.set_context('notebook',font_scale = 1.4)plt.figure(figsize = (6,4))
sns.regplot(x = 'density',y = 'alcohol',data = df,scatter_kws = {'s':10},color = color[1])
plt.xlabel('density',fontsize = 12)
plt.ylabel('alcohol',fontsize = 12)plt.xlim(0.989,1.005)
plt.ylim(7,16)print('Figure 9: Density vs Alcohol')


因为PH和非挥发性酸之间存在着-0.68的相关性,因为非挥发性酸的总量特别高,所以total acid这个指标意义不大。

['fixed acidity','volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide','total sulfur dioxide','density','pH','sulphates','alcohol','total acid']
acidity_raleted = ['fixed acidity','volatile acidity','total sulfur dioxide','chlorides','total acid']plt.figure(figsize = (10,6))for i in range(5):plt.subplot(2,3,i+1)sns.regplot(x = 'pH',y = acidity_raleted[i],data = df,scatter_kws = {'s':10},color = color[1])plt.xlabel('PH',fontsize = 12)plt.ylabel(acidity_raleted[i],fontsize = 12)plt.tight_layout()
print('Figure 10:The correlation between different acid and PH')



plt.style.use('ggplot')plt.figure(figsize = (6,4))
sns.lmplot(x = 'alcohol',y = 'volatile acidity',hue = 'quality',data = df,fit_reg = False,scatter_kws = {'s':10},size = 5)
print('Figure 11-1:scatter plot between alcohol and volatile acidity and quality')

sns.lmplot(x = 'alcohol', y = 'volatile acidity', col='quality', hue = 'quality', data = df,fit_reg = False, size = 3,  aspect = 0.9, col_wrap=3,scatter_kws={'s':20})
print("Figure 11-2: Scatter Plots of Alcohol, Volatile Acid and Quality")



sns.set_context("notebook", font_scale= 1.4)plt.figure(figsize=(6,5))
cm = plt.cm.get_cmap('RdBu')
sc = plt.scatter(df['fixed acidity'], df['citric acid'], c=df['pH'], vmin=2.6, vmax=4, s=15, cmap=cm)
bar = plt.colorbar(sc)
bar.set_label('pH', rotation = 0)
plt.xlabel('fixed acidity')
plt.ylabel('citric acid')
print('Figure 12: pH with Fixed Acidity and Citric Acid')




  • 线性回归
  • 集成算法
  • 提升算法
  • 模型评估
  • 确定模型参数


1.1 切分特征和标签

1.2 切分训练集个测试集

fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality sweetness total acid
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.998 3.51 0.56 9.4 5 dry 8.10
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.997 3.20 0.68 9.8 5 dry 8.68
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.997 3.26 0.65 9.8 5 dry 8.60
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.998 3.16 0.58 9.8 6 dry 12.04
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.998 3.51 0.56 9.4 5 dry 8.10
# 数据预处理工作# 检查数据的完整性
fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
sweetness               0
total acid              0
dtype: int64
# 将object类型的数据转化为int类型
sweetness = pd.get_dummies(df['sweetness'])
df = pd.concat([df,sweetness],axis = 1)
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality sweetness total acid dry semi-dry semi-sweet
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.998 3.51 0.56 9.4 5 dry 8.10 1 0 0
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.997 3.20 0.68 9.8 5 dry 8.68 1 0 0
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.997 3.26 0.65 9.8 5 dry 8.60 1 0 0
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.998 3.16 0.58 9.8 6 dry 12.04 1 0 0
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.998 3.51 0.56 9.4 5 dry 8.10 1 0 0
df = df.drop('sweetness',axis = 1)
labels = df['quality']
features = df.drop('quality',axis = 1)# 对原始数据集进行切分
from sklearn.model_selection import train_test_split
train_features,test_features,train_labels,test_labels = train_test_split(features,labels,test_size = 0.3,random_state = 0)print('训练特征的规模:',train_features.shape)
训练特征的规模: (1119, 15)
训练标签的规模: (1119,)
测试特征的规模: (480, 15)
测试标签的规模: (480,)
from sklearn.linear_model import LinearRegression
LR = LinearRegression()
LR.fit(train_features,train_labels)prediction = LR.predict(test_features)

array([5.75571751, 4.82871294, 6.59036909, 5.36644662, 5.89993476])

from sklearn.metrics import mean_squared_error
RMSE = np.sqrt(mean_squared_error(test_labels,prediction))

线性回归模型的预测误差: 0.6332278109768246

# 对训练特征和测试特征做标准化处理,观察结果from sklearn.preprocessing import StandardScaler
train_features_std = StandardScaler().fit_transform(train_features)
test_features_std = StandardScaler().fit_transform(test_features)
LR = LinearRegression()
prediction = LR.predict(test_features_std)#观察预测结果误差
RMSE = np.sqrt(mean_squared_error(prediction,test_labels))

线性回归模型预测误差: 0.6351421172394885



from sklearn.ensemble import RandomForestRegressor
RF = RandomForestRegressor()
prediction = RF.predict(test_features)
RMSE = np.sqrt(mean_squared_error(prediction,test_labels))

随机森林模型的预测误差: 0.6142407237123461

<bound method BaseEstimator.get_params of RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,max_features='auto', max_leaf_nodes=None,min_impurity_decrease=0.0, min_impurity_split=None,min_samples_leaf=1, min_samples_split=2,min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,oob_score=False, random_state=None, verbose=0, warm_start=False)>
from sklearn.model_selection import GridSearchCV
param_grid = {'n_estimators':[100,200,300,400,500],'max_depth':[3,4,5,6],'min_samples_split':[2,3,4]}RF = RandomForestRegressor()
grid = GridSearchCV(RF,param_grid = param_grid,scoring = 'neg_mean_squared_error',cv = 3,n_jobs = -1)
GridSearchCV(cv=3, error_score='raise-deprecating',estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,max_features='auto', max_leaf_nodes=None,min_impurity_decrease=0.0, min_impurity_split=None,min_samples_leaf=1, min_samples_split=2,min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,oob_score=False, random_state=None, verbose=0, warm_start=False),fit_params=None, iid='warn', n_jobs=-1,param_grid={'n_estimators': [100, 200, 300, 400, 500], 'max_depth': [3, 4, 5, 6], 'min_samples_split': [2, 3, 4]},pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',scoring='neg_mean_squared_error', verbose=0)
grid.best_params_{'max_depth': 6, 'min_samples_split': 2, 'n_estimators': 300}
RF = RandomForestRegressor(n_estimators = 300,min_samples_split = 2,max_depth = 6)RF.fit(train_features,train_labels)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=6,max_features='auto', max_leaf_nodes=None,min_impurity_decrease=0.0, min_impurity_split=None,min_samples_leaf=1, min_samples_split=2,min_weight_fraction_leaf=0.0, n_estimators=300, n_jobs=None,oob_score=False, random_state=None, verbose=0, warm_start=False)
prediction = RF.predict(test_features)RF_RMSE = np.sqrt(mean_squared_error(prediction,test_labels))

随机森林模型的预测误差: 0.6153424077044428


from sklearn.ensemble import GradientBoostingRegressorGBDT = GradientBoostingRegressor()
gbdt_prediction = GBDT.predict(test_features)
gbdt_RMSE = np.sqrt(mean_squared_error(gbdt_prediction,test_labels))print('GBDT模型的预测误差:',gbdt_RMSE)

GBDT模型的预测误差: 0.6232190669430115

<bound method BaseEstimator.get_params of GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,learning_rate=0.1, loss='ls', max_depth=3, max_features=None,max_leaf_nodes=None, min_impurity_decrease=0.0,min_impurity_split=None, min_samples_leaf=1,min_samples_split=2, min_weight_fraction_leaf=0.0,n_estimators=100, n_iter_no_change=None, presort='auto',random_state=None, subsample=1.0, tol=0.0001,validation_fraction=0.1, verbose=0, warm_start=False)>

随机参数搜索模型 RandomizedSearchCV

from sklearn.model_selection import RandomizedSearchCV
GBDT = GradientBoostingRegressor()
learning_rate = [0.01,0.1,1,10]
max_depth = [3,4,5,6]
min_samples_leaf = [1,2,4]
min_samples_split = [2,5,10]
n_estimators = [int(x) for x in range(100,600,100)]random_params_group = {'learning_rate':learning_rate,'max_depth':max_depth,'min_samples_leaf':min_samples_leaf,'min_samples_split':min_samples_split,'n_estimators':n_estimators}random_model = RandomizedSearchCV(GBDT,param_distributions = random_params_group,n_iter = 100,scoring = 'neg_mean_squared_error',verbose = 2,n_jobs = -1,cv = 3,random_state = 0)


  1. 数据分析案例--红酒数据集分析

    介绍: 这篇文章主分析了红酒的通用数据集,这个数据集一共有1600个样本,11个红酒的理化性质,以及红酒的品质(评分从0到10).这里主要用python进行分析,主要内容分为:单变量,双变量,和多变量 ...

  2. 红酒数据集分析【详细版】

    红酒数据集分析[详细版] 原文链接:阿里云天池 数据连接:链接:https://pan.baidu.com/s/1UpVkbgOEIjpc_GQTGHyqTQ 提取码:ztjs 介绍 这个notebo ...

  3. 红酒数据集分析(纯数字数据集)

    红酒数据集数据分析 导入相关包 导入数据及总览 单变量分析 处理红酒的酸度特征 处理甜度特征 双变量分析 红酒品质vs.其他特征 密度vs.酒精浓度 酸性物质含量vs.pH 多变量分析 pH,非挥发性 ...

  4. 机器学习案例——鸢尾花数据集分析

        前几天把python基础知识过了一遍,拿了这个小例子作为练手项目,这个案例也有师兄的帮助,记录完,发现代码贴的很多,文章有点长,为了节省篇幅,有一些说明就去掉了,毕竟鸢尾花数据集比较经典,网上 ...

  5. 红酒、白酒数据集分析——案例(1)

    详见:red_white_wine_quality数据集分析 (一)数据集概览 有两个样本: winequality-red.csv:红葡萄酒样本 red-wine 数据集 winequality-w ...

  6. 2021年大数据Spark(二十一):Spark Core案例-SogouQ日志分析

    目录 案例-SogouQ日志分析 业务需求 准备工作 HanLP 中文分词 样例类 SogouRecord 业务实现 ​​​​​​​搜索关键词统计 ​​​​​​​用户搜索点击统计 ​​​​​​​搜索时 ...

  7. 五十三、爱彼迎数据集分析建模

    爱彼迎数据集分析建模为本专栏的Python数据分析案例. 因为文件比较大,所以保存了百度云 链接:https://pan.baidu.com/s/1geUgsLejvpTKgBmcSMSIdQ 提取码 ...

  8. 视频教程-Python数据分析与案例教程:分析人口普查数据-Python

    Python数据分析与案例教程:分析人口普查数据 多年互联网从业经验: 有丰富的的企业网站.手游.APP开发经验: 曾担任上海益盟软件技术股份有限公司项目经理及产品经理: 参与项目有益盟私募工厂.睿妙 ...

  9. 图解大数据 | 综合案例-使用Spark分析挖掘零售交易数据

    作者:韩信子@ShowMeAI 教程地址:http://www.showmeai.tech/tutorials/84 本文地址:http://www.showmeai.tech/article-det ...


  1. Nginx强制https访问
  2. 学python可以做什么职业好-业余学Python能做什么?对职业发展有什么帮助?
  3. 搭建helm私服ChartMuseum
  4. mysql备份工具xtr_mysql-xtrbackup备份与恢复
  5. git lfs的安装和使用详细案例
  6. 更新被拒绝,因为远程仓库包含您本地尚不存在的提交。这通常是因为另外
  7. java代码杨辉三角_用java实现杨辉三角的示例代码
  8. 潘正磊:再过三五年 AI会变成开发人员的基本概念
  9. MyBatis中SQL语句相关内容
  10. ethtool的内核流程跟踪
  11. centos中安装、升级git
  12. 数据挖掘 自习笔记 第三章 定性归纳实践(下)
  13. talib python文档_TALib中文文档代码实现
  14. bin文件的安装方法
  15. windows7无法在域中找到计算机账户,关于Windows 7电脑加入域的问题
  16. 人工智能期末考试复习
  17. 用米思齐mixly和APP INVENTOR 2通过MQTT控制灯亮和熄
  18. 普罗米修斯(Prometheus)安装配置部署
  19. 关于自动化测试的定位及一些实践思考
  20. ubuntu下查看电脑系统信息


  1. Unity-3D相机跟随控制
  2. Mac电脑解决Google翻译失效实用方法
  3. SaltStack_rhel6.5
  4. C++ 系统宏定义 windows mac linux android ios
  5. Python——批量获取某宝商品价格
  6. 零基础想学速写?要想学好速写先了解这几步
  7. php试卷系统制作_php题库系统与试卷生成系统
  8. android 锁屏显示音乐播放器,Android锁屏界面控制音乐播放
  9. Java-Java绘图坐标体系
  10. VolgaCTF2015之lcg的writeup