葡萄酒数据集_如何使用数据科学来理解什么使葡萄酒味道更好

葡萄酒数据集

Data Science. It’s been touted as the sexiest job of the 21st century. Everyone — from companies to individuals — is trying to understand it and adopt it. And if you’re a programmer, you most definitely are experiencing FoMo (Fear of missing out)! Just look at how popular the term has become over time:

数据科学。它被吹捧为21世纪最性感的工作。从公司到个人的每个人都在试图理解和采用它。如果您是一名程序员，那么您肯定会遇到FoMo(害怕错过)！只要看看这个术语随着时间的流行而变得：

The average salary of a data scientist is over $120,000 in the United States according to Indeed. They also currently have the highest paid jobs, with a median of $60k .Glassdoor also named it the “Best job of 2016,” and it’s also been ranked the number one job on Glassdoor.

根据Indeed，数据科学家在美国的平均薪水超过12万美元。他们目前也是薪水最高的工作，中位数为6 万美元。Glassdoor还称其为“ 2016年最佳工作”，并且在Glassdoor上也排名第一。

But what’s this science that they keep talking about? Keep reading!

但是他们一直在谈论什么科学？继续阅读！

目录 (Table of Contents)

The need for Data Science

对数据科学的需求
Who will get the most out of this tutorial

谁将从本教程中获得最大收益
Getting Started

入门
Data Analysis

数据分析
Exploring relationships between features and Data Visualization Techniques

探索功能与数据可视化技术之间的关系
Outlier Detection

离群值检测

数据科学需要什么？ (What’s the need for Data Science?)

To put it simply, Data Science helps you be data-driven. Data-driven decisions help companies understand their customers better and build great businesses.

简而言之，数据科学可以帮助您成为数据驱动的人 。数据驱动的决策可帮助公司更好地了解客户并建立出色的业务。

We live in the age of information explosion. Companies collect different kinds of data depending upon the type of business they run. For example — for a retail store, the data could be about the kinds of products that its customers buy over time, and their spending amounts. For Netflix, it could be about what shows most of their users watch or like and their demographics.

我们生活在信息爆炸的时代。公司根据其经营的业务类型收集不同类型的数据。例如，对于一家零售商店，数据可能是有关其客户随时间购买的产品种类及其支出金额的。对于Netflix，这可能与大多数用户观看或喜欢的节目及其人口统计有关。

Business decisions often rely on lots of intuition and domain knowledge. Now, as data grows larger and larger, it becomes difficult for us to make sense of it. We simply aren’t equipped with the mental faculty to pour over large data-sets filled with tons of information.

业务决策通常依赖大量的直觉和领域知识。现在，随着数据越来越大，我们变得很难理解它。我们根本就没有精神力，无法倾倒充满大量信息的大型数据集。

The purpose of Data Science is to tell you a story and help you visualize it.

数据科学的目的是告诉您一个故事并帮助您形象化。

Using it you can:

使用它，您可以：

Get lot insights from the data that could otherwise go undetected从数据中获取大量见解，否则可能无法发现
Make faster decisions, because well — computers are faster than humans after all!做出更快的决策，因为好吧，毕竟计算机比人类要快！
Eliminate a lot of bias that goes behind decision making. Throughout history, humans have always been quite prone to letting their feelings and prejudices cloud their judgement…消除决策背后的许多偏见。在整个历史中，人类一直很容易让自己的感觉和偏见笼罩他们的判断力……

But unlike humans, computers don’t need to sit in a business meeting and get into a pissing contest about why a certain decision is better than the others.

但是，与人类不同，计算机不需要参加商务会议，也不必参加关于为什么某个决定比其他决定更好的竞赛。

Now that we have understood what it’s all about, it’s time to learn it!

现在我们已经了解了所有内容，是时候学习它了！

谁将从本教程中获得最大收益： (Who will get the most out of this tutorial:)

People with some basic knowledge in programming, who want to understand data science and its applications.拥有一些编程方面的基础知识的人员，想了解数据科学及其应用程序。
People who find math and statistics a little overwhelming in the beginning.刚开始发现数学和统计学的人有点不知所措。
If you’re even remotely interested in wines, then read it — just for the heck of it!如果您甚至对葡萄酒不感兴趣，那就读一看吧！

让我们开始吧！ (Let’s get Started!)

In this tutorial, you’ll understand how to analyze a wine data-set, observe its features, and extract different insights from it. After finishing this tutorial, you’ll:

在本教程中，您将了解如何分析葡萄酒数据集，观察其功能并从中提取不同的见解。完成本教程后，您将：

Understand how Data Science can be used to analyze and get insights from data.了解如何使用数据科学来分析数据并从中获取见解。
Become knowledgeable about wine. ;-)了解葡萄酒。 ;-)

Even if you don’t drink, that’s all-right — you’ll still become a budding Sommelier, or Oenophile (yes, that’s an actual term!).

即使您不喝酒，也没关系-您仍然会成为发芽的侍酒师或Oenophile(是的，这是一个实际名词！)。

In the next blog post, you’ll see applied data science in the form of machine learning:

在下一篇博客文章中，您将以机器学习的形式看到应用数据科学：

What is ML, and what kinds of problems can be solved using it?什么是ML，使用ML可以解决什么类型的问题？
How to train a classifier using ML to identify good wines from bad wines.如何使用ML训练分类器以从劣质酒中识别出优质酒。
Different performance metrics不同的绩效指标

在阅读之前要知道： (Know before you read:)

I’m assuming that you already have some knowledge of programming. Some programming knowledge of Python is necessary, so if you know it, you’ll find this tutorial relatively simple. If you don’t, I highly recommend that you check out this free course on Introduction to Python.

我假设您已经具备一些编程知识。需要一些Python的编程知识，因此，如果您了解Python ，您会发现本教程相对简单。如果您不这样做，我强烈建议您阅读有关Python入门的免费课程。

Why Python? Because it is fast emerging as the preferred choice of language for data science. It is fairly easy to pick up and learn, and the Python ecosystem has a lot of tools and libraries for virtually building anything — ranging from web servers, packages for machine learning, statistics, deep learning, to IoT. Python has also one of the most active communities on the internet, like stack overflow.

为什么是Python？ 因为它正在Swift成为数据科学语言的首选。拿起和学习起来非常容易，Python生态系统具有许多工具和库，它们可以虚拟构建任何东西-从Web服务器，用于机器学习的软件包，统计信息，深度学习到IoT。 Python也是Internet上最活跃的社区之一，例如stackoverflow 。

Some basic knowledge of libraries like numpy and pandas will also be helpful, although not compulsory.

尽管不是强制性的，但一些诸如numpy和pandas之类的库的基本知识也将有所帮助。

本教程需要什么： (What you’ll need for this tutorial:)

Preferably, a Linux based distro (Ubuntu or Linux Mint) with Python installed.

最好是安装了Python的基于Linux的发行版(Ubuntu或Linux Mint)。
Install Anaconda. It’s an open-source package management system and environment management system, primarily for Python programs. For training and testing our machine learning models, you will be using a very popular open source library called scikit-learn.

安装Anaconda 。这是一个开放源代码的程序包管理系统和环境管理系统，主要用于Python程序。为了训练和测试我们的机器学习模型，您将使用一个非常流行的开源库scikit-learn。
Download the project files from this repository into your machine. Then open up a terminal, cd to your project folder, and run pip install -r requirements.txt to install the dependencies.

将项目文件从此存储库下载到您的计算机中。然后打开一个终端， cd到您的项目文件夹，然后运行pip install -r requirements.txt 安装依赖项。
Alternatively, you could also upload the project files to FloydHub and run your codes, without the hassles of setting up. I recommend it if you don’t have a Linux based system.

或者，您也可以将项目文件上传到FloydHub并运行代码，而无需进行设置。如果您没有基于Linux的系统，我建议您这样做。

You’ll be using an IPython (Interactive Python) notebook file to run your code. After downloading the project files, open up your terminal, cd to your project folder, and run jupyter-notebook. This will open a new window on your default browser on the port 8888. If you’re using FloydHub, the same notebook file can be run from there as well.

您将使用IPython(交互式Python)笔记本文件来运行代码。下载项目文件后，打开终端cd 到项目文件夹，然后运行jupyter-notebook 。这将在端口8888的默认浏览器上打开一个新窗口。如果您使用的是FloydHub，则也可以从那里运行相同的笔记本文件。

You’ll find two IPython notebook files. Select the one named game-of-wines.ipynb from the list. The other notebook file contains the full source code for this tutorial.

您会发现两个IPython笔记本文件。从列表中选择一个名为game-of-wines.ipynb 。另一个笔记本文件包含本教程的完整源代码。

如何使用此笔记本 (How to use this notebook)

The notebook already has some template code and explanations inside cells for you to get started. At a few places, you’ll find that code has already been written for you, just to make it easier. You’ll also find comments and links wherever necessary.

笔记本中的单元格中已有一些模板代码和说明，供您入门。在一些地方，您会发现已经为您编写了代码，只是为了使其变得更容易。您还将在必要时找到评论和链接。

To execute code in a cell, click on it with your mouse, then select the run option on the title bar of the book.

要在单元格中执行代码，请用鼠标单击它，然后在书的标题栏上选择run选项。

好吧，干杯！让我们s一口...哎呀，研究我们的葡萄酒数据。 (Alright, Cheers! Let’s sip… whoops, study our wine data.)

I was searching on the internet the other day for some interesting open-source data. Kaggle has a very active community where you can easily search for different kinds of data-sets and solve challenges. Another awesome place to look for data-sets is The University of California Irvine’s Machine Learning Repository.

前几天，我在互联网上搜索了一些有趣的开源数据。 Kaggle有一个非常活跃的社区，您可以在其中轻松地搜索不同类型的数据集并解决挑战。寻找数据集的另一个很棒的地方是加利福尼亚大学尔湾分校的机器学习存储库。

The UCI Machine Learning repository has two sets of wine data. One dataset contains information on red wines, and the other for white wines. Your project folder already contains both of them. These wines were produced at Vinho Verde, a region on the north of Portugal.

UCI机器学习存储库具有两组酒数据。一个数据集包含有关红葡萄酒的信息，另一个数据集包含有关白葡萄酒的信息。您的项目文件夹已经包含它们两者。这些葡萄酒是在葡萄牙北部的Vinho Verde生产的。

First, you’ll import some libraries needed for our data analysis. Click on the cell block, then select the run command to load all of them.

首先，您将导入一些我们的数据分析所需的库。单击单元格块，然后选择run命令以加载所有它们。

Next, we’ll load our data-set into our notebook and display the first 5 rows. Type this code in the cell block of your notebook and then run it:

接下来，我们将数据集加载到笔记本中并显示前5行。在笔记本的单元格中键入以下代码，然后运行它：

# Load the Red Wines dataset
data = pd.read_csv("data/winequality-red.csv", sep=';')# Display the first five records
display(data.head(n=5))

It’ll print this output:

它将输出以下输出：

As you can see, there are about 12 different features for each wine in the data-set. The last column, quality, is a metric of how good a specific wine was rated to be, between 1 to 10.

如您所见，数据集中每种葡萄酒大约有12种不同的功能。最后一列质量是衡量特定葡萄酒的好坏程度的指标，介于1到10之间。

Let’s see if any of these columns have missing information. Type this in the cell block:

让我们看看这些列中是否有缺少的信息。在单元格中键入以下内容：

data.isnull().any()

The output shows us that no columns are empty.

输出显示没有列为空。

We can get some more additional information on our data-set by running:

通过运行以下命令，我们可以获得有关数据集的更多信息：

data.info()

Let’s try performing some preliminary analysis on our wines. For our purposes, let’s consider all wines with ratings 7 and above to be of very good quality, wines with 5 and 6 to be of average quality, and wines less than 5 to be of insipid quality:

让我们尝试对我们的葡萄酒进行一些初步分析。为了我们的目的，让我们考虑所有等级7及以上的葡萄酒质量都很好，等级5和6的葡萄酒质量一般，而等级低于5的葡萄酒质量很低：

n_wines = data.shape[0]# Number of wines with quality rating above 6
quality_above_6 = data.loc[(data['quality'] > 6)]
n_above_6 = quality_above_6.shape[0]# Number of wines with quality rating below 5
quality_below_5 = data.loc[(data['quality'] < 5)]
n_below_5 = quality_below_5.shape[0]# Number of wines with quality rating between 5 to 6
quality_between_5 = data.loc[(data['quality'] >= 5) & (data['quality'] <= 6)]
n_between_5 = quality_between_5.shape[0]# Percentage of wines with quality rating above 6
greater_percent = n_above_6*100/n_wines# Print the results
print("Total number of wine data: {}".format(n_wines))
print("Wines with rating 7 and above: {}".format(n_above_6))
print("Wines with rating less than 5: {}".format(n_below_5))
print("Wines with rating 5 and 6: {}".format(n_between_5))
print("Percentage of wines with quality 7 and above: {:.2f}%".format(greater_percent))# Some more additional data analysis
display(np.round(data.describe()))

You can also view the distributions in quality on a graph:

您还可以在图表上查看质量分布：

# Visualize skewed continuous features of original data
vs.distribution(data, "quality")

As you can see, most wines fall under the average quality. There are fewer wines that are of very high quality and great tasting, and very few wines that aren’t so good.

如您所见，大多数葡萄酒的质量都低于平均水平。高质量和口感极佳的葡萄酒很少，劣质的葡萄酒也很少。

We can also use pandas describe method to get useful statistics, such as mean, median and standard deviation of the features in our data:

我们还可以使用pandas describe方法获取有用的统计信息，例如数据中特征的均值，中位数和标准差：

Some useful statistics that you should know about:

您应该了解一些有用的统计信息：

Mean (Average): Perhaps the most familiar one of all. Just add up all the sample values for a given feature, then divide it by the number of samples.

平均值(平均)：也许是最熟悉的一种。只需将给定功能的所有样本值相加，然后将其除以样本数即可。
Median: First you arrange all the sample values in numerical order, in a list. The middle number in this list will be the median.

中位数：首先，将所有样本值按数字顺序排列在列表中。此列表中的中间数字将是中位数。
Mode: The value that occurs the most in a list of samples.

模式：在样本列表中出现次数最多的值。
Range: The difference between the highest value and the lowest values in a list.

范围：列表中最大值和最小值之间的差。
Standard Deviation: It is used to measure the dispersion of values in a set. First calculate the mean, then subtract each number in the list with the mean and square the result. Then calculate the mean of those squared differences, and finally calculate the square root of it.

标准偏差：用于测量一组值的离散度。首先计算平均值，然后用平均值减去列表中的每个数字，然后对结果求平方。然后计算这些平方差的均值，最后计算其平方根。

现在，下一步是更详细地研究数据集中的功能。 (Now, the next step is to study the features in our data-set in more detail.)

The quality of wine depends upon a bunch of chemical properties that affect their taste, aroma and flavor. So yes, even though wine making is considered an art, it’s actually pretty scientific if you think about it.

葡萄酒的质量取决于一系列影响其口味，香气和风味的化学性质。因此，是的，尽管酿酒被认为是一门艺术，但如果您考虑一下它实际上是相当科学的。

In Data Science, having domain knowledge can be the key differentiating factor between mediocre and great insights.

在数据科学中，拥有领域知识可能是区分平庸和深刻见解的关键因素。

是时候让我们的葡萄酒词汇变得更明智了！ (Time to get wiser with our wine vocab!)

Wines contain varying proportions of sugars, alcohol, organic acids, salts from mineral and organic acids, phenolic compounds, pigments, nitrogenous substances, pectins, gums, mucilage, volatile aromatic compounds (esters, aldehydes and ketones), vitamins, salts and sulfur dioxide.

葡萄酒中含有不同比例的糖，酒精，有机酸，无机酸和有机酸的盐，酚类化合物，色素，含氮物质，果胶，树胶，粘液，挥发性芳香族化合物(酯，醛和酮)，维生素，盐和二氧化硫。

In wine tasting, the term acidity refers to the fresh, tart and sour attributes of the wine. Three primary acids are found in wine grapes — tartaric, malic and citric acids. They are evaluated in relation to how well the acidity balances out the sweetness and bitter components of the wine, such as tannins.

在品酒中，酸度是指葡萄酒的新鲜，酸味和酸味。酿酒葡萄中发现了三种主要酸-酒石酸，苹果酸和柠檬酸。根据酸度与葡萄酒的甜度和苦味成分(如单宁酸)之间的平衡程度进行评估。

Fixed Acidity

固定酸度

Fixed Acidity

固定酸度

Titratable acidity, sometimes referred to as fixed acidity, is a measurement of the total concentration of titratable acids and free hydrogen ions present in your wine. A

可滴定酸度(有时称为固定酸度)是对葡萄酒中可滴定酸和游离氢离子的总浓度的度量。一个

litmus paper can be used to identify whether a given solution is acidic or basic. The most common titratable acids are tartaric, malic, citric and carbonic acid. These acids, along with many more in smaller quantities, either occur naturally in the grapes or are created through the fermentation process.

石蕊试纸可用于识别给定溶液是酸性还是碱性。最常见的可滴定酸是酒石酸，苹果酸，柠檬酸和碳酸。这些酸以及少量的更多酸，要么天然存在于葡萄中，要么通过发酵过程产生。

Volatile Acidity

挥发性酸度

Volatile Acidity

挥发性酸度

Volatile acidity is mostly caused by bacteria in the wine creating acetic acid — the acid that gives vinegar its characteristic flavor and aroma — and its byproduct, ethyl acetate. Volatile acidity could be an indicator of spoilage, or errors in the manufacturing processes — caused by things like damaged grapes, wine exposed to air, and so on. This causes acetic acid bacteria to enter and thrive, and give rise to unpleasant tastes and smells. Wine experts can often tell this just by smelling it!

挥发性酸度主要是由葡萄酒中的细菌引起，产生了乙酸(一种使醋具有独特风味和香气的酸)及其副产物乙酸乙酯。挥发性酸度可能表示变质或制造过程中的错误-由诸如损坏的葡萄，暴露于空气中的葡萄酒等引起的。这会使乙酸细菌进入并繁殖，并引起令人不快的味道和气味。葡萄酒专家通常只能闻一闻而已！
Citric Acid

柠檬酸

Citric Acid

柠檬酸

Citric acid is generally found in very small quantities in wine grapes. It acts as a preservative and is

通常在酿酒葡萄中发现柠檬酸的数量很少。它起着防腐剂的作用，

added to wines to increase acidity, complement a specific flavor or prevent ferric hazes. It can be added to finished wines to increase acidity and give a “fresh” flavor. Excess addition, however, can ruin the taste.

添加到葡萄酒中以增加酸度，补充特定风味或防止铁的混浊。可以将其添加到成品酒中以增加酸度并赋予“新鲜”风味。但是，过量添加会破坏口味。

Residual Sugars

残留糖

Residual Sugars

残留糖

Residual SugarsResidual Sugar, or RS for short, refers to any natural grape sugars that are leftover after fermentation ceases (whether on purpose or not). The juice of wine grapes starts out intensely sweet, and fermentation uses up that sugar as the yeasts feast upon it.

残留糖 的残余糖 ，或简称为RS，是指任何天然葡萄糖，是经过发酵停止(无论是故意或不)剩余。酿酒葡萄的汁开始时非常甜，随着酵母在酒上发酵，发酵会消耗掉糖。

Residual SugarsResidual Sugar, or RS for short, refers to any natural grape sugars that are leftover after fermentation ceases (whether on purpose or not). The juice of wine grapes starts out intensely sweet, and fermentation uses up that sugar as the yeasts feast upon it.

残留糖 的残余糖 ，或简称为RS，是指任何天然葡萄糖，是经过发酵停止(无论是故意或不)剩余。酿酒葡萄的汁开始时非常甜，随着酵母在酒上发酵，发酵会消耗掉糖。

During winemaking, yeast typically converts all the sugar into alcohol making a

在酿酒过程中，酵母通常会将所有糖转化为酒精，从而产生

dry wine. However, sometimes not all the sugar is fermented by the yeast, leaving some sweetness leftover.

干酒。但是，有时并不是所有的糖都被酵母发酵，剩下一些甜味。

dry wine. However, sometimes not all the sugar is fermented by the yeast, leaving some sweetness leftover.

干酒。但是，有时并不是所有的糖都被酵母发酵，剩下一些甜味。

To make a wine that tastes good, the key is to have a perfect balance between the sweetness and the sourness in the drink.

要酿造出口感好的葡萄酒，关键是要在饮料的甜度和酸度之间取得完美的平衡。

Chloride

氯化物

Chloride

氯化物

The amount of chlorides present in a wine is usually an indicator of its “

葡萄酒中氯化物的含量通常是其“

saltiness.” This is usually influenced by the territory where the wine grapes grew, cultivation methods, and also the grape type. Too much saltiness is considered undesirable. The right proportion can make the wine more savory.

咸味。” 这通常受酿酒葡萄产地，种植方法以及葡萄类型的影响。咸度过高被认为是不希望的。合适的比例可以使葡萄酒更咸。

Sulphur Dioxide levels

二氧化硫含量

Sulphur Dioxide levels

二氧化硫含量

Sulfur dioxide exists in wine in free and bound forms, and the combinations are referred to as

葡萄酒中的二氧化硫以游离形式和结合形式存在，其结合称为

total SO2. It’s the most common preservative used, usually added by wine makers to protect the wine from negative effects of exposure to air and oxygen. Wines with added sulphur dioxide contents usually have “Contains Sulphites” on their labels.

总SO2 。它是最常用的防腐剂，通常由酿酒师添加以保护葡萄酒免受暴露于空气和氧气的不利影响。添加了二氧化硫的葡萄酒通常在标签上标有“亚硫酸盐”。

total SO2. It’s the most common preservative used, usually added by wine makers to protect the wine from negative effects of exposure to air and oxygen. Wines with added sulphur dioxide contents usually have “Contains Sulphites” on their labels.

总SO2 。它是最常用的防腐剂，通常由酿酒师添加以保护葡萄酒免受暴露于空气和氧气的不利影响。添加了二氧化硫的葡萄酒通常在标签上标有“亚硫酸盐”。

It acts as a

它充当

sanitizing agent — adding it usually kills unwanted bacteria or yeast that might enter the wine and spoil its taste and aroma. It was first used in winemaking by the Romans, when they discovered that burning sulfur candles inside empty wine vessels keeps them fresh and free from vinegar smell. Pretty neat, huh?

消毒剂 -加入它通常会杀死有害的细菌或酵母，这些细菌或酵母可能会进入葡萄酒并破坏其味道和香气。它最初是由罗马人用于酿酒的，当时他们发现在空的葡萄酒容器中燃烧硫磺蜡烛可以使它们保持新鲜，并且没有醋味。很整洁吧？

Density

密度

Density

密度

Also known as specific gravity, it can be used to measure the alcohol concentration in wines. During fermentation, the sugar in the juice is converted into ethanol with carbon dioxide as a waste gas. Monitoring the density during the process allows for optimal control of this conversion step for highest quality wines. Sweeter wines generally have higher densities.

也称为比重，可用于测量葡萄酒中的酒精浓度。在发酵过程中，果汁中的糖被二氧化碳作为废气转化为乙醇。在过程中监控密度可以优化控制此转换步骤，以生产出最高质量的葡萄酒。甜酒通常具有较高的密度。

pH

pH值

pH

pH值

pH stands for power of hydrogen, which is a measurement of the hydrogen ion concentration in the solution. Generally, solutions with a pH value less than 7 are considered acidic, with some of the strongest acids being close to 0. Solutions above 7 are considered alkaline or basic. The pH value of water is 7, as it is neither an acid nor a base.

pH代表氢的功率，它是溶液中氢离子浓度的量度。通常，pH值小于7的溶液被认为是酸性的，其中一些最强的酸接近于0。高于7的溶液被认为是碱性或碱性的。水的pH值为7，因为它既不是酸也不是碱。

Sulphates

硫酸盐

Sulphates

硫酸盐

Sulfates are salts of sulfuric acid. They aren’t involved in wine production, but some beer makers use calcium sulfate — also known as brewers’ gypsum — to correct mineral deficiencies in water during the brewing process. It also adds a bit of a “sharp” taste.

硫酸盐是硫酸盐。他们不参与葡萄酒生产，但是一些啤酒制造商使用硫酸钙(也称为啤酒石膏)来纠正酿造过程中水中的矿物质缺乏症。它还增加了一些“尖锐”的味道。

Alcohol

醇

Alcohol

醇

Ah yes, alcohol — the key to a successful party! Alcoholic beverages existed from at least the neolithic period (10,000 BC). Drinking it in small amounts gives you warm fuzzy feelings inside, and makes you more sociable. Of-course, higher doses can also make you pass out.

是的，酒-成功聚会的关键！酒精饮料至少存在于新石器时代(公元前10,000年)。少量喝它会使您内心产生温暖的模糊感觉，并使您更友善。当然，更高的剂量也可以使您昏倒。

努力工作，我的数据科学家哥们！ (Booze diligently, my data scientist buddy!)

探索功能和可视化之间的关系 (Exploring relationships between features and Visualizations)

Now that you have some domain knowledge about wine, it’s time to explore more. Our dataset contains a bunch of features as we saw above, such as alcohol levels, amount of residual sugar and pH value. Some of these features might be dependent on other features, some might not. Some of them might affect our quality ratings too.

既然您已经掌握了有关葡萄酒的专业知识，那么该是时候进行更多探索了。我们的数据集包含如上所述的一系列特征，例如酒精含量，残留糖量和pH值。其中某些功能可能依赖于其他功能，而某些功能可能不依赖。其中一些可能也会影响我们的质量评级。

In data science or machine learning, it’s quite important to study the features that make up our data and see if there are any co-relations between them.

在数据科学或机器学习中，研究组成数据的功能并查看它们之间是否存在任何关联非常重要。

For example, does pH level affect fixed acidity levels? Do volatile acidity levels have anything to do with the quality? Do people find wines with more alcohol content to be tastier or of better quality?

例如，pH值会影响固定的酸度吗？挥发性酸度与质量有关吗？人们是否发现酒精含量更高的葡萄酒更美味或更优质？

Lucky for us, Python has awesome libraries that do the heavy lifting of providing different kinds of visualizations. Let’s try plotting a scatterplot with our data and observe what what it shows us.

对我们来说幸运的是，Python有很棒的库，它们可以提供各种可视化效果。让我们尝试用我们的数据绘制散点图，并观察它显示的内容。

pd.plotting.scatter_matrix(data, alpha = 0.3, figsize = (40,40), diagonal = 'kde');

From the above scatterplot we can get some interesting details. For some of the features, the distribution appears to be fairly linear. For some others, the distribution appears to be negatively skewed. So this confirms our initial suspicions — there are indeed some interesting co-dependencies between some of the features.

从上面的散点图中，我们可以获得一些有趣的细节。对于某些特征，分布似乎是线性的。对于另一些人，分布似乎出现了负偏斜。因此，这证实了我们最初的怀疑-某些功能之间确实存在一些有趣的相互依赖关系。

We can plot a heatmap of co-relations between features, which will help us get more insights.

我们可以绘制功能之间相互关系的热图，这将有助于我们获得更多见解。

correlation = data.corr()
# display(correlation)
plt.figure(figsize=(14, 12))
heatmap = sns.heatmap(correlation, annot=True, linewidths=0, vmin=-1, cmap="RdBu_r")

As you can see, the squares with positive values show direct co-relationships between features. The higher the values, the stronger these relationships are — they’ll be more reddish. That means, if one feature increases, the other one also tends to increase, and vice-versa.

如您所见，具有正值的正方形显示要素之间的直接关联。值越高，这些关系越强-它们将变得微红色。这意味着，如果一个功能增加，则另一个功能也趋于增加，反之亦然。

The squares that have negative values show an inverse co-relationship. The more negative these values get, the more inversely proportional they are, and they’ll be more blue. This means that if the value of one feature is higher, the value of the other one gets lower.

具有负值的平方显示逆相关关系。这些值越负，则它们成反比，并且它们将越蓝。这意味着，如果一项功能的价值较高，则另一项的价值较低。

Finally, squares close to zero indicate almost no co-dependency between those sets of features.

最后，接近零的平方表示这些要素集之间几乎没有相互依赖性。

Pretty interesting huh? Let’s explore these co-relationships in more detail.

很有趣吧？ 让我们更详细地探讨这些相互关系。

pH vs. Fixed Acidity

pH与固定酸度

#Visualize the co-relation between pH and fixed Acidity#Create a new dataframe containing only pH and fixed acidity columns to visualize their co-relations
fixedAcidity_pH = data[['pH', 'fixed acidity']]#Initialize a joint-grid with the dataframe, using seaborn library
gridA = sns.JointGrid(x="fixed acidity", y="pH", data=fixedAcidity_pH, size=6)#Draws a regression plot in the grid
gridA = gridA.plot_joint(sns.regplot, scatter_kws={"s": 10})#Draws a distribution plot in the same grid
gridA = gridA.plot_marginals(sns.distplot)

This scatter-plot shows how the values of pH change with changing fixed acidity levels. We can see that, as fixed acidity levels increase, the pH levels drop. Makes sense doesn’t it? A lower pH level is, after all, an indicator of high acidity.

此散点图显示了pH值如何随固定酸度水平的变化而变化。我们可以看到，随着固定酸度的增加，pH值下降。有道理不是吗？毕竟，较低的pH值是高酸性的指示。

Fixed Acidity vs. Citric Acid

固定酸度与柠檬酸

fixedAcidity_citricAcid = data[['citric acid', 'fixed acidity']]
g = sns.JointGrid(x="fixed acidity", y="citric acid", data=fixedAcidity_citricAcid, size=6)
g = g.plot_joint(sns.regplot, scatter_kws={"s": 10})
g = g.plot_marginals(sns.distplot)

As the amount of citric acids increase, so do the fixed acidity levels.

随着柠檬酸量的增加，固定酸度水平也随之增加。

Volatile Acidity vs Quality

挥发性酸度与质量

fig, axs = plt.subplots(ncols=1,figsize=(10,6))
sns.barplot(x='quality', y='volatile acidity', data=volatileAcidity_quality, ax=axs)
plt.title('quality VS volatile acidity')plt.tight_layout()
plt.show()
plt.gcf().clear()

A higher quality is usually associated with low volatile acidity levels. This makes sense, because volatile acidity is an indicator of spoilage and could give rise to unpleasant aromas — consistent with our domain knowledge.

较高的质量通常与较低的挥发性酸度有关。这是有道理的，因为挥发性酸度是变质的指标，并且可能引起令人不快的香气-与我们的领域知识一致。

Alcohol vs. Quality

酒精与品质

fig, axs = plt.subplots(ncols=1,figsize=(10,6))
sns.barplot(x='quality', y='alcohol', data=quality_alcohol, ax=axs)
plt.title('quality VS alcohol')plt.tight_layout()
plt.show()
plt.gcf().clear()

Hmm. Seems like most people generally like wines that contain a higher percentage of alcohol, ones that make them feel woozy!

嗯似乎大多数人都喜欢葡萄酒中酒精含量较高的葡萄酒，使他们感到昏昏欲睡！

Try experimenting with more features on your own in the notebook, and see if they reveal anything. If they are related in some way, what do you think might be the reason? Exploring will reveal more hidden insights.

尝试自己在笔记本电脑上尝试更多功能，看看它们是否有任何作用。如果它们之间存在某种联系，您认为可能是什么原因？探索将揭示更多隐藏的见解。

It helps to remember that co-relation does not always imply causation. Sometimes when you plot graphs for two features, it might show you a pattern which might just be a co-incidence. Here’s an example —

有助于记住， 相互关系并不总是暗示因果关系 。有时，当您绘制两个要素的图形时，它可能会向您显示一个可能只是重合的模式。这是一个例子

So why then do we do it? It serves as a helpful ground to see if our data-set’s integrity is intact!

那么为什么要这样做呢？ 它可以作为查看我们的数据集完整性是否完整的有帮助的基础！

For example, we know for sure that pH levels must decrease if the acidity levels increase. But if our graphs show the opposite, then that’s an indicator that something is amiss — our data-set isn’t reliable. And that could make our predictions wrong! Here’s where having domain knowledge again proves to be helpful.

例如，我们可以肯定地知道，如果酸度增加，pH值必须降低。但是，如果我们的图形显示相反的情况，则表明存在某些问题–我们的数据集不可靠。这可能会使我们的预测错误！在这里再次获得领域知识证明是有帮助的。

离群值检测 (Outlier Detection)

Let’s say in the kingdom of Lannisters, there are about 10,000 adult people. Most of them are of average height (above 5 feet), but there are about a 100 people who are dwarfs. These are also called outliers, because they are extreme values that fall outside of the expected range of heights. In other words, an outlier is a data point that is significantly distant from most other data points.

假设在Lannisters王国中，大约有10,000名成年人。他们中的大多数人平均身高(5英尺以上)，但大约有100人是矮人。这些也称为离群值，因为它们是超出预期高度范围的极值。换句话说，离群值是与大多数其他数据点相距很远的数据点。

Why do they matter? Because they can sometimes cause lot of problems in data analysis. Let’s say that you’re trying to calculate the average temperature of 10 randomly selected objects in your room, and nine of them are between 20 and 25 degrees Celsius. But you have left your oven on, and it is at 175 °C. The median temperature will be between 20 and 25 °C, but the mean temperature will be between 35.5 and 40 °C. In this case, the median better reflects the temperature of a randomly sampled object, because that falls under what’s expected in your room.

他们为什么重要？ 因为它们有时会在数据分析中引起很多问题。假设您要计算房间中随机选择的10个对象的平均温度，其中9个在20到25摄氏度之间。但是，您已经离开烤箱了，温度为175°C。中值温度将在20至25°C之间，但平均温度将在35.5至40°C之间。在这种情况下，中位数可以更好地反映随机采样对象的温度，因为该温度低于您房间的预期温度。

Hence, finding out Outliers is critical, because very small or very large values can negatively affect our data analysis, and consequently our predictions. So sometimes, it becomes necessary to remove them.

因此，找出异常值是至关重要的，因为非常小的值或很大的值都可能对我们的数据分析以及预测产生负面影响。因此，有时有必要将其删除。

Tukey的异常值检测方法 (Tukey’s Method for Detecting Outliers)

I read a really nice post on this method that you could read later. But in short, here is how the technique works:

我阅读了有关此方法的一篇非常好的文章，以后可以阅读。简而言之，这是该技术的工作原理：

First, you start with dividing the sorted data into four intervals, in such a way that the resulting sections would each contain about 25% of the total data points. The value at which these intervals are split are called Quartiles.

首先，首先将排序后的数据分为四个间隔，以使所得到的部分分别包含总数据点的25％。这些间隔被分割的值称为四分位数 。
Then you subtract the 3rd Quartile from the 1st Quartile to get the Interquartile Range (IQR). That’s the middle 50%, and it contains the bulk of the data.

然后从第一四分位数中减去第三四分位数即可得出四分位数间距(IQR) 。那是中间的50％，其中包含大部分数据。
Any data point that lies beyond 1.5 times the IQR would be considered as an outlier.

任何超出IQR 1.5倍的数据点将被视为异常值。

Run the following in the next code block to print out outliers for all the features in your data-set.

在下一个代码块中运行以下命令，以打印出数据集中所有功能的异常值。

# For each feature find the data points with extreme high or low values
for feature in data.keys():# TODO: Calculate Q1 (25th percentile of the data) for the given featureQ1 = np.percentile(data[feature], q=25)# TODO: Calculate Q3 (75th percentile of the data) for the given featureQ3 = np.percentile(data[feature], q=75)# TODO: Use the interquartile range to calculate an outlier step (1.5 times the interquartile range)interquartile_range = Q3 - Q1step = 1.5 * interquartile_range# Display the outliersprint("Data points considered outliers for the feature '{}':".format(feature))display(data[~((data[feature] >= Q1 - step) & (data[feature] <= Q3 + step))])# OPTIONAL: Select the indices for data points you wish to removeoutliers = []# Remove the outliers, if any were specifiedgood_data = data.drop(data.index[outliers]).reset_index(drop = True)

And…we’re done!

而且...我们完成了！

I hope you had fun with this tutorial. Now you know more about data science than before!

我希望您能从本教程中获得乐趣。现在，您比以往更加了解数据科学！

Of course, there is more to it than meets the eye. If you’d like to know more, I highly recommend that you check these out:

当然，它所具有的不仅仅是眼神。如果您想了解更多信息，我强烈建议您查看以下内容：

Statistics and Probability Courses

统计与概率课程
Introduction to Python for Data Science

数据科学Python简介
The Best Data Science Courses on the Internet

互联网上最好的数据科学课程
How to become a Data Scientist

如何成为数据科学家

But for now, take a break and then head over to the next tutorial, where you’ll dive into some core machine learning stuff. You’ll learn to build a machine learning model, to which if you gave it wine attributes, it would give you an accurate quality rating!

但是，暂时休息一下，然后转到下一个教程，您将在其中学习一些核心的机器学习知识。您将学习建立机器学习模型，如果您赋予它酒属性，它将为您提供准确的质量评级！

Using Machine Learning to Predict the Quality of Wines

使用机器学习来预测葡萄酒的质量

Also published in my tech blog. Liked what you read? Head over there and subscribe! I won’t waste your time.

还发布在我的技术博客中。 喜欢您阅读的内容吗？ 前往那里订阅！ 我不会浪费你的时间。

Feel free to leave a comment below if you have any questions, or if you’d like a tutorial on a cool topic.

如果您有任何疑问，或者想要一个关于酷话题的教程，请在下面发表评论。

翻译自: https://www.freecodecamp.org/news/using-data-science-to-understand-what-makes-wine-taste-good-669b496c67ee/

葡萄酒数据集