数据科学 (DATA SCIENCE)

If you are an aspiring data scientist or a veteran data scientist, this article is for you! In this article, we will be building a simple regression model in Python. To spice things up a bit, we will not be using the widely popular and ubiquitous Boston Housing dataset but instead, we will be using a simple Bioinformatics dataset. Particularly, we will be using the Delaney Solubility dataset that represents an important physicochemical property in computational drug discovery.

如果您是有抱负的数据科学家或经验丰富的数据科学家,那么本文适合您! 在本文中,我们将在Python中构建一个简单的回归模型。 为了使事情更加有趣,我们将不使用广泛流行且无处不在的Boston Housing数据集,而是将使用简单的Bioinformatics数据集。 特别是,我们将使用代表计算药物发现中重要物理化学性质的Delaney溶解度数据集。

The aspiring data scientist will find the step-by-step tutorial particularly accessible while the veteran data scientist may want to find a new challenging dataset for which to try out their state-of-the-art machine learning algorithm or workflow.

有抱负的数据科学家会发现分步教程特别易于访问,而经验丰富的数据科学家可能希望找到一个新的具有挑战性的数据集,以尝试其最新的机器学习算法或工作流程。

1.我们今天要建设什么? (1. What we are Building Today?)

A regression model! And we are going to use Python to do that. While we’re at it, we are going to use a bioinformatics dataset (technically, it’s cheminformatics dataset) for the model building.

回归模型! 我们将使用Python来做到这一点。 在此过程中,我们将使用生物信息学数据集(从技术上讲,它是化学信息学数据集)进行模型构建。

Particularly, we are going to predict the LogS value which is the aqueous solubility of small molecules. The aqueous solubility value is a relative measure of the ability of a molecule to be soluble in water. It is an important physicochemical property of effective drugs.

特别是,我们将预测LogS值,该值是小分子的水溶性。 水溶性值是分子溶于水的能力的相对量度。 它是有效药物的重要理化性质。

What better way to get acquainted with the concept of what we are building today than a cartoon illustration!

有比卡通插图更好的方法来熟悉我们今天正在构建的概念!

Cartoon illustration of the schematic workflow of machine learning model building of the cheminformatics dataset where the target response variable is predicted as a function of input molecular features. Technically, this procedure is known as quantitative structure-activity relationship (QSAR). (Drawn by Chanin Nantasenamat
化学数据集机器学习模型构建的示意性工作流程的卡通插图,其中目标响应变量根据输入分子特征而预测。 从技术上讲,此过程称为定量构关系 (QSAR)。 (由Chanin Nantasenamat绘制

2.德莱尼溶解度数据集 (2. Delaney Solubility Dataset)

2.1。 数据理解 (2.1. Data Understanding)

As the name implies, the Delaney solubility dataset is comprised of the aqueous solubility values along with their corresponding chemical structure for a set of 1,144 molecules. For those, outside the field of biology there are some terms that we will spend some time on clarifying.

顾名思义, Delaney溶解度数据集由水溶性溶解度值以及一组1,144个分子的相应化学结构组成。 对于那些在生物学领域之外的人,我们将花费一些时间来澄清它们。

Molecules or sometimes referred to as small molecules or compounds are chemical entities that are made up of atoms. Let’s use some analogy here and let’s think of atoms as being equivalent to Lego blocks where 1 atom being 1 Lego block. When we use several Lego blocks to build something whether it be a house, a car or some abstract entity; such constructed entities are comparable to molecules. Thus, we can refer to the specific arrangement and connectivity of atoms to form a molecule as the chemical structure.

分子或有时称为小分子或化合物的分子是由原子组成的化学实体。 让我们在这里使用一些类比,让我们认为原子等同于乐高积木,其中1个原子等于1个乐高积木。 当我们使用几个乐高积木来建造东西时,无论是房屋,汽车还是抽象物体。 这样构造的实体可与分子相比。 因此,我们可以将形成分子的原子的特定排列和连通性称为化学结构

Analogy of the construction of molecules to Lego blocks. This yellow house is from Lego 10703 Creative Builder Box. (Drawn by Chanin Nantasenamat)类似于乐高积木的分子构造。 这个黄色的房子来自Lego 10703 Creative Builder Box。 (由Chanin Nantasenamat绘制)

So how does each of the entities that you are building differ? Well, they differ by the spatial connectivity of the blocks (i.e. how the individual blocks are connected). In chemical terms, each molecules differ by their chemical structures. Thus, if you alter the connectivity of the blocks, consequently you would have effectively altered the entity that you are building. For molecules, if atom types (e.g. carbon, oxygen, nitrogen, sulfur, phosphorus, fluorine, chlorine, etc.) or groups of atoms (e.g. hydroxy, methoxy, carboxy, ether, etc.) are altered then the molecules would also be altered consequently becoming a new chemical entity (i.e. that is a new molecule is produced).

那么,您要构建的每个实体有何不同? 好吧,它们的区别在于块的空间连通性(即各个块的连接方式)。 用化学术语来说,每个分子的化学结构都不同。 因此,如果您更改块的连接性,则将有效地更改您正在构建的实体。 对于分子,如果原子类型(例如碳,氧,氮,硫,磷,氟,氯等)或原子团(例如羟基,甲氧基,羧基,醚等)发生改变,则分子也将被改变改变从而成为新的化学实体(即产生了新的分子)。

Cartoon illustration of a molecular model. Red, blue, dark gray and white represents oxygen, nitrogen, carbon and hydrogen atoms while the light gray connecting the atoms are the bonds. Each atoms can be comparable to a Lego block. The constructed molecule shown above is comparable to a constructed Lego entity (such as the yellow house shown above in this article). (Drawn by Chanin Nantasenamat)一个分子模型的动画片例证。 红色,蓝色,深灰色和白色表示氧,氮,碳和氢原子,而连接原子的浅灰色是键。 每个原子都可以相当于一个乐高积木。 上面显示的构建分子与构建的Lego实体(例如本文上面显示的黄色房屋)相当。 (由Chanin Nantasenamat绘制)

To become an effective drug, molecules will need to be uptake and distributed in the human body and such property is directly governed by the aqueous solubility. Solubility is an important property that researchers take into consideration in the design and development of therapeutic drugs. Thus, a potent drug that is unable to reach the desired destination target owing to its poor solubility would be a poor drug candidate.

为了成为有效的药物,分子将需要被吸收并分布在人体中,并且这种性质直接受水溶性的支配 。 溶解度是研究人员在设计和开发治疗药物时要考虑的重要属性。 因此,由于溶解度差而无法达到所需目标靶点的有效药物将是较差的药物候选物。

2.2。 检索数据集 (2.2. Retrieving the Dataset)

The aqueous solubility dataset as performed by Delaney in the research paper entitled ESOL: Estimating Aqueous Solubility Directly from Molecular Structure is available as a Supplementary file. For your convenience, we have also downloaded the entire Delaney solubility dataset and made it available on the Data Professor GitHub.

Delaney在题为ESOL:直接从分子结构直接估算水溶性的研究论文中提供的水溶性数据集可作为补充文件使用 。 为了方便起见,我们还下载了整个Delaney溶解度数据集,并在Data Professor GitHub上提供了该数据

Preview of the raw version of the Delaney solubility dataset. The Delaney溶解度数据集的原始版本的预览。 full version is available on the 完整版本可在Data Professor GitHub.Data Professor GitHub上获得 。

CODE PRACTICE

守则实务

Let’s get started, shall we?

让我们开始吧,好吗?

Fire up Google Colab or your Jupyter Notebook and run the following code cells.

启动Google Colab或Jupyter Notebook,然后运行以下代码单元。

CODE EXPLANATION

代码说明

Let’s now go over what each code cells mean.

现在让我们看一下每个代码单元的含义。

The first code cell,

一个代码单元

  • As the code literally says, we are going to import the pandas library as pd.

    就像代码所说的那样,我们将把pandas库导入为pd

The second code cell:

第二个代码单元

  • Assigns the URL where the Delaney solubility dataset resides to the delaney_url variable.

    将Delaney溶解度数据集所在的URL分配给delaney_url变量。

  • Reads in the Delaney solubility dataset via the pd.read_csv() function and assigns the resulting dataframe to the delaney_df variable.

    通过pd.read_csv()函数读取Delaney溶解度数据集,并将结果数据帧分配给delaney_df变量。

  • Calls the delaney_df variable to return the output value that essentially prints out a dataframe containing the following 4 columns:

    调用delaney_df变量以返回输出值,该输出值实质上打印出包含以下4列的数据delaney_df

  1. Compound ID — Names of the compounds.

    化合物ID-化合物的名称。

  2. measured log(solubility:mol/L) — The experimental aqueous solubility values as reported in the original research article by Delaney.

    测得的log(溶解度:mol / L) -实验水溶解度值​​,由Delaney在原始研究文章中报道。

  3. ESOL predicted log(solubility:mol/L) — Predicted aqueous solubility values as reported in the original research article by Delaney.

    ESOL预测的log(溶解度:mol / L) -预测的水溶解度值​​,由Delaney在原始研究文章中报告。

  4. SMILES — A 1-dimensional encoding of the chemical structure information

    SMILES —化学结构信息的一维编码

2.3。 计算分子描述符 (2.3. Calculating the Molecular Descriptors)

A point it note is that the above dataset as originally provided by the authors is not yet useable out of the box. Particularly, we will have to use the SMILES notation to calculate the molecular descriptors via the rdkit Python library as demonstrated in a step-by-step manner in a previous Medium article (How to Use Machine Learning for Drug Discovery).

需要注意的一点是,上述由作者最初提供的数据集尚无法立即使用。 特别是,我们将不得不使用SMILES表示法来通过rdkit Python库计算分子描述符 ,如先前的中篇文章( 如何使用机器学习进行药物发现 )中逐步说明的那样。

It should be noted that the SMILES notation is a one-dimensional depiction of the chemical structure information of the molecules. Molecular descriptors are quantitative or qualitative description of the unique physicochemical properties of molecules.

应该注意的是, SMILES符号是分子化学结构信息的一维描述。 分子描述符分子独特物理化学性质的定量或定性描述。

Let’s think of molecular descriptors as a way to uniquely represent the molecules in numerical form that can be understood by machine learning algorithms to learn from, make predictions and provide useful knowledge on the structure-activity relationship. As previously noted, the specific arrangement and connectivity of atoms produce different chemical structures that consequently dictates the resulting activity that they will produce. Such notion is known as structure-activity relationship.

让我们将分子描​​述符视为以数字形式唯一表示分子的一种方法,机器学习算法可以理解该分子以学习,进行预测并提供有关结构-活性关系的有用知识。 如前所述,原子的特定排列和连通性会产生不同的化学结构,从而决定它们将产生的最终活性。 这种概念被称为结构-活性关系。

The processed version of the dataset containing the calculated molecular descriptors along with their corresponding response variable (logS) is shown below. This processed dataset is now ready to be used for machine learning model building whereby the first 4 variables can be used as the X variables and the logS variables can be used as the Y variable.

包含计算的分子描述符及其相应的响应变量(logS)的数据集的处理版本如下所示。 现在已准备好将此处理后的数据集用于机器学习模型的构建,其中前四个变量可以用作X变量,而logS变量可以用作Y变量。

Preview of the processed version of the Delaney solubility dataset. Essentially, the SMILES notation from the raw version was used as input to compute the 4 molecular descriptors as described in detail in a previous Delaney溶解度数据集处理版本的预览。 本质上,原始版本的SMILES表示法用作输入来计算4个分子描述符,如先前的上Medium article and 一篇中型文章和YouTube video. The YouTube视频中详细描述的那样。 full version is available on the 完整版本可在Data Professor GitHub.Data Professor GitHub上获得 。

A quick description of the 4 molecular descriptors and response variable is provided below:

下面提供了4种分子描述符和响应变量的快速描述:

  1. cLogP — Octanol-water partition coefficient

    cLogP —辛醇-水分配系数

  2. MW — Molecular weight

    MW —分子量

  3. RB —Number of rotatable bonds

    可旋转键RB -Number

  4. APAromatic proportion = number of aromatic atoms / total number of heavy atoms

    AP 芳香比例=芳香原子数/重原子总数

  5. LogS — Log of the aqueous solubility

    LogS —水溶性的对数

CODE PRACTICELet’s continue by reading in the CSV file that contains the calculated molecular descriptors.

代码实践让我们继续阅读包含计算出的分子描述符的CSV文件。

CODE EXPLANATION

代码说明

Let’s now go over what the code cells mean.

现在让我们来看一下代码单元的含义。

  • Assigns the URL where the Delaney solubility dataset (with calculated descriptors) resides to the delaney_url variable.

    将Delaney溶解度数据集(具有计算的描述符)所在的URL分配给delaney_url变量。

  • Reads in the Delaney solubility dataset (with calculated descriptors) via the pd.read_csv() function and assigns the resulting dataframe to the delaney_descriptors_df variable.

    通过pd.read_csv()函数读取Delaney溶解度数据集(具有计算的描述符),并将结果数据帧分配给delaney_descriptors_df变量。

  • Calls the delaney_descriptors_df variable to return the output value that essentially prints out a dataframe containing the following 5 columns:

    调用delaney_descriptors_df变量以返回输出值,该输出值实质上打印出包含以下5列的数据delaney_descriptors_df

  1. MolLogPMolLogP
  2. MolWt摩尔
  3. NumRotatableBondsNumRotatableBonds
  4. AromaticProportion芳香比例
  5. logS日志

The first 4 columns are molecular descriptors computed using the rdkit Python library. The fifth column is the response variable logS.

前4列是使用rdkit Python库计算的分子描述符。 第五列是响应变量logS

3.数据准备 (3. Data Preparation)

3.1。 将数据分离为X和Y变量 (3.1. Separating the data as X and Y variables)

In building a machine learning model using the scikit-learn library, we would need to separate the dataset into the input features (the X variables) and the target response variable (the Y variable).

在使用scikit-learn库构建机器学习模型时,我们需要将数据集分为输入要素( X变量)和目标响应变量( Y变量)。

CODE PRACTICE

守则实务

Follow along and implement the following 2 code cells to separate the dataset contained with the delaney_descriptors_df dataframe to X and Y subsets.

遵循并实现以下2个代码单元,以将delaney_descriptors_df数据帧中包含的数据集分离为XY子集。

CODE EXPLANATION

代码说明

Let’s take a look at the 2 code cells.

让我们看一下这两个代码单元。

First code cell:

第一个代码单元:

  • Here we are using the drop() function to specifically ‘drop’ the logS variable (which is the Y variable and we will be dealing with it in the next code cell). As a result, we will have 4 remaining variables which are assigned to the X dataframe. Particularly, we apply the drop() function to the delaney_descriptors_df dataframe as in delaney_descriptors_df.drop(‘logS’, axis=1) where the first input argument is the specific column that we want to drop and the second input argument of axis=1 specifies that the first input argument is a column.

    在这里,我们使用drop()函数专门“删除” logS变量(它是Y变量,我们将在下一个代码单元中处理它)。 结果,我们将有4个剩余变量被分配给X数据帧。 特别是,我们将drop()函数应用于delaney_descriptors_df数据帧,如delaney_descriptors_df.drop('logS', axis=1) ,其中第一个输入参数是我们要删除的特定列,第二个输入参数是axis=1指定第一个输入参数是一列。

Second code cell:

第二个代码单元:

  • Here we select a single column (the ‘logS’ column) from the delaney_descriptors_df dataframe via delaney_descriptors_df.logS and assigning this to the Y variable.

    在这里,我们通过delaney_descriptors_df.logSdelaney_descriptors_df数据delaney_descriptors_df.logS选择单个列(“ logS”列),并将其分配给Y变量。

3.2。 数据分割 (3.2. Data splitting)

In evaluating the model performance, the standard practice is to split the dataset into 2 (or more partitions) partitions and here we will be using the 80/20 split ratio whereby the 80% subset will be used as the train set and the 20% subset the test set. As scikit-learn requires that the data be further separated to their X and Y components, the train_test_split() function can readily perform the above-mentioned task.

在评估模型性能时,标准做法是将数据集分为2个(或更多分区)分区,这里我们将使用80/20的拆分比率,其中80%的子集将用作训练集,而20%子集测试集。 由于scikit-learn需要将数据进一步分离为其XY分量,所以train_test_split()函数可以轻松地执行上述任务。

CODE PRACTICE

守则实务

Let’s implement the following 2 code cells.

让我们实现以下2个代码单元。

CODE EXPLANATION

代码说明

Let’s take a look at what the code is doing.

让我们看一下代码在做什么。

First code cell:

第一个代码单元:

  • Here we will be importing the train_test_split from thescikit-learn library.

    在这里,我们将从thescikit-learn库中导入train_test_split

Second code cell:

第二个代码单元:

  • We start by defining the names of the 4 variables that the train_test_split() function will generate and this includes X_train, X_test, Y_train and Y_test. The first 2 corresponds to the X dataframes for the train and test sets while the last 2 corresponds to the Y variables for the train and test sets.

    我们首先定义train_test_split()函数将生成的4个变量的名称,其中包括X_trainX_testY_trainY_test 。 前2个对应于火车和测试集的X个数据帧,而后2个对应于火车和测试集的Y个变量。

4.线性回归模型 (4. Linear Regression Model)

Now, comes the fun part and let’s build a regression model.

现在,有趣的部分来了,让我们建立一个回归模型。

4.1。 训练线性回归模型 (4.1. Training a linear regression model)

CODE PRACTICE

守则实务

Here, we will be using the LinearRegression() function from scikit-learn to build a model using the ordinary least squares linear regression.

在这里,我们将使用scikit-learn的LinearRegression()函数使用普通的最小二乘线性回归来构建模型。

CODE EXPLANATION

代码说明

Let’s see what the codes are doing

让我们看看代码在做什么

First code cell:

第一个代码单元:

  • Here we import the linear_model from the scikit-learn library在这里,我们从scikit-learn库中导入linear_model

Second code cell:

第二个代码单元:

  • We assign the linear_model.LinearRegression() function to the model variable.

    我们将linear_model.LinearRegression()函数分配给model变量。

  • A model is built using the command model.fit(X_train, Y_train) whereby the model.fit() function will take X_train and Y_train as input arguments to build or train a model. Particularly, the X_train contains the input features while the Y_train contains the response variable (logS).

    使用命令model.fit(X_train, Y_train)构建模型model.fit(X_train, Y_train)其中model.fit()函数将X_trainY_train作为输入参数来构建或训练模型。 特别是, X_train包含输入X_train ,而Y_train包含响应变量(logS)。

4.2。 应用训练好的模型来预测训练和测试集中的logS (4.2. Apply trained model to predict logS from the training and test set)

As mentioned above, model.fit() trains the model and the resulting trained model is saved into the model variable.

如上所述, model.fit()对模型进行训练,并将得到的训练后的模型保存到model变量中。

CODE PRACTICE

守则实务

We will now apply the trained model to make predictions on the training set (X_train).

现在,我们将应用训练后的模型对训练集( X_train )进行预测。

We will now apply the trained model to make predictions on the test set (X_test).

现在,我们将应用经过训练的模型对测试集( X_test )进行预测。

CODE EXPLANATION

代码说明

Let’s proceed to the explanation.

让我们继续进行说明。

The following explanation will cover only the training set (X_train) as the exact same concept can be identically applied to the test set (X_test) by performing the following simple tweaks:

以下解释将仅涵盖训练集( X_train ),因为可以通过执行以下简单的调整将完全相同的概念等同地应用于测试集( X_test ):

  • Replace X_train by X_test

    X_train替换X_test

  • Replace Y_train by Y_test

    Y_train替换为Y_test

  • Replace Y_pred_train by Y_pred_test

    Y_pred_train替换为Y_pred_test

Everything else are exactly the same.

其他所有内容都完全相同。

First code cell:

第一个代码单元:

  • Predictions of the logS values will be performed by calling the model.predict() and using X_train as the input argument such that we run the command model.predict(X_train). The resulting predicted values will be assigned to the Y_pred_train variable.

    通过调用model.predict()并使用X_train作为输入参数来执行logS值的预测,以便我们运行命令model.predict(X_train) 。 结果预测值将分配给Y_pred_train变量。

Second code cell:

第二个代码单元:

Model performance metrics are now printed.

现在将显示模型性能指标。

  • Regression coefficient values are obtained from model.coef_,

    回归系数值是从model.coef_获得的,

  • The y-intercept value is obtained from model.intercept_,

    y截距值是从model.intercept_获得的,

  • The mean squared error (MSE) is computed using the mean_squared_error() function using Y_train and Y_pred_train as input arguments such that we run mean_squared_error(Y_train, Y_pred_train)

    使用mean_squared_error()函数并使用Y_trainY_pred_train作为输入参数来计算均方误差(MSE),以便我们运行mean_squared_error(Y_train, Y_pred_train)

  • The coefficient of determination (also known as R²) is computed using the r2_score() function using Y_train and Y_pred_train as input arguments such that we run r2_score(Y_train, Y_pred_train)

    确定系数(也称为R²)是使用r2_score()函数使用Y_trainY_pred_train作为输入参数来计算的,因此我们可以运行r2_score(Y_train, Y_pred_train)

4.3。 打印出回归方程 (4.3. Printing out the Regression Equation)

The equation of a linear regression model is actually the model itself whereby you can plug in the input feature values and the equation will return the target response values (LogS).

线性回归模型的方程实际上是模型本身,您可以在其中插入输入要素值,该方程将返回目标响应值(LogS)。

CODE PRACTICE

守则实务

Let’s now print out the regression model equation.

现在让我们打印出回归模型方程式。

CODE EXPLANATION

代码说明

First code cell:

第一个代码单元:

  • All the components of the regression model equation is derived from the model variable. The y-intercept and the regression coefficients for LogP, MW, RB and AP are provided in model.intercept_, model.coef_[0], model.coef_[1], model.coef_[2] and model.coef_[3].

    回归模型方程式的所有组成部分均来自model变量。 在model.intercept_model.coef_[0]model.coef_[1]model.coef_[2]model.coef_[3]中提供了model.intercept_ ,MW,RB和AP的y截距和回归系数。 。

Second code cell:

第二个代码单元:

  • Here we put together the components and print out the equation via the print() function.

    在这里,我们将各个组件放在一起,然后通过print()函数打印出方程式。

5.实验与预测LogS的散点图 (5. Scatter Plot of experimental vs. predicted LogS)

We will now visualize the relative distribution of the experimental versus predicted LogS by means of a scatter plot. Such plot will allow us to quickly see the model performance.

现在,我们将通过散点图可视化实验与预测LogS的相对分布。 这样的绘图将使我们能够快速查看模型性能。

CODE PRACTICE

守则实务

In the forthcoming examples, I will show you how to layout the 2 sub-plots differently namely: (1) vertical plot and (2) horizontal plot.

在接下来的示例中,我将向您展示如何以不同的方式布局两个子图:(1)垂直图和(2)水平图。

CODE EXPLANATION

代码说明

Let’s now take a look at the underlying code for implementing the vertical and horizontal plots. Here, I provide 2 options for you to choose from whether to have the layout of this multi-plot figure in the vertical or horizontal layout.

现在让我们看一下实现垂直和水平绘图的基础代码。 在这里,我提供2个选项供您选择,以垂直或水平布局显示此多图图形的布局。

Import libraries

导入库

Both start by importing the necessary libraries namely matplotlib and numpy. Particularly, most of the code will be using matplotlib for creating the plot while the numpy library is used here to add a trend line.

两者都从导入必要的库matplotlibnumpy 。 特别是,大多数代码将使用matplotlib创建图,而此处使用numpy库添加趋势线。

Define figure size

定义图形尺寸

Next, we specify the figure dimensions (what will be the width and height of the figure) via plt.figure(figsize=(5,11)) for the vertical plot and plt.figure(figsize=(11,5)) for the horizontal plot. Particularly, (5,11) tells matplotlib that the figure for the vertical plot should be 5 inches wide and 11 inches tall while the inverse is used for the horizontal plot.

接下来,我们通过plt.figure(figsize=(5,11))为垂直图指定图形尺寸(图形的宽度和高度plt.figure(figsize=(5,11)) ,并为以下图形plt.figure(figsize=(11,5))水平图。 特别是,(5,11)告诉matplotlib,垂直图的图形应为5英寸宽,11英寸高,而水平图应使用反图。

Define placeholders for the sub-plots

定义子图的占位符

We will tell matplotlib that we want to have 2 rows and 1 column and thus its layout will be that of a vertical plot. This is specified by plt.subplot(2, 1, 1) where input arguments of 2, 1, 1 refers to 2 rows, 1 column and the particular sub-plot that we are creating underneath it. In other words, let’s think of the use of plt.subplot() function as a way of structuring the plot by creating placeholders for the various sub-plots that the figure contains. The second sub-plot of the vertical plot is specified by the value of 2 in the third input argument of the plt.subplot() function as in plt.subplot(2, 1, 2).

我们将告诉matplotlib我们想要2行1列,因此其布局应为垂直图。 这是通过指定plt.subplot(2, 1, 1)其中的输入参数2, 1, 1指的是2行,第1列和所述特定子情节我们正在创建它的下方。 换句话说,让我们考虑使用plt.subplot()函数,通过为图形所包含的各个子图创建占位符来构造图的方式。 垂直图的第二个子图由plt.subplot()函数的第三个输入参数中的值2指定,如plt.subplot(2, 1, 2)

By applying the same concept, the structure of the horizontal plot is created to have 1 row and 2 columns via plt.subplot(1, 2, 1) and plt.subplot(1, 2, 2) that houses the 2 sub-plots.

通过应用相同的概念,通过容纳2个子图的plt.subplot(1, 2, 2) plt.subplot(1, 2, 1)plt.subplot(1, 2, 2) plt.subplot(1, 2, 1)将水平图的结构创建为具有1行和2列。

Creating the scatter plot

创建散点图

Now that the general structure of the figure is in place, let’s now add the data visualizations. The data scatters are added using the plt.scatter() function as in plt.scatter(x=Y_train, y=Y_pred_train, c=”#7CAE00", alpha=0.3) where x refers to the data column to use for the x axis, y refers to the data column to use for the y axis, c refers to the color to use for the scattered data points and alpha refers to the alpha transparency level (how translucent the scattered data points should be, the lower the number the more transparent it becomes), respectively.

现在已经有了图形的一般结构,现在让我们添加数据可视化。 像使用plt.scatter(x=Y_train, y=Y_pred_train, c=”#7CAE00", alpha=0.3)一样,使用plt.scatter()函数添加数据分散plt.scatter(x=Y_train, y=Y_pred_train, c=”#7CAE00", alpha=0.3)其中x用于x的数据列轴, y要用于y轴的数据列, c要用于散乱数据点的颜色, alpha表示alpha透明度级别(散乱数据点应具有的半透明性,数字越低变得更加透明)。

Adding the trend line

添加趋势线

Next, we use the np.polyfit() and np.poly1d() functions from numpy together with the plt.plot () function from matplotlib to create the trend line.

接下来,我们使用numpynp.polyfit()np.poly1d()函数以及matplotlibplt.plot ()函数来创建趋势线。

# Add trendline# https://stackoverflow.com/questions/26447191/how-to-add-trendline-in-python-matplotlib-dot-scatter-graphsz = np.polyfit(Y_train, Y_pred_train, 1)p = np.poly1d(z)plt.plot(Y_test,p(Y_test),"#F8766D")

Adding the x and y axes labels

添加x和y轴标签

To add labels for the x and y axes, we use the plt.xlabel() and plt.ylabel() functions. It should be noticed that for the vertical plot, we omit the x axis label for the top sub-plot (Why? Because it is redundant with the x-axis label for the bottom sub-plot).

要为xy轴添加标签,我们使用plt.xlabel()plt.ylabel()函数。 应当注意,对于垂直图,我们省略了顶部子图的x轴标签( 为什么?因为它与底部子图的x轴标签是多余的 )。

Saving the figure

保存身材

Finally, we are going to save the constructed figure to file and we can do that using the plt.savefig() function from matplotlib and specifying the file name as the input argument. Lastly, finish off with plt.show().

最后,我们将把构造plt.savefig()图形保存到文件中,我们可以使用matplotlibplt.savefig()函数并指定文件名作为输入参数来完成此操作。 最后,以plt.show()

plt.savefig('plot_vertical_logS.png')plt.savefig('plot_vertical_logS.pdf')plt.show()

VISUAL EXPLANATION

视觉说明

The above section provides a text-based explanation and in this section we are going to do the same with this visual explanation that makes use of color highlights to distinguish the different components of the plot.

上一节提供了基于文本的解释,在本节中,我们将使用视觉突出显示来做同样的事情,该视觉解释使用颜色突出显示来区分绘图的不同组成部分。

Visual explanation on creating a scatter plot. Here we color highlight the specific lines of code and their corresponding plot component. (Drawn by Chanin Nantasenamat)关于创建散点图的直观说明。 在这里,我们用彩色突出显示特定的代码行及其对应的绘图组件。 (由Chanin Nantasenamat绘制)

需要您的反馈 (Need Your Feedback)

As an educator, I love to hear how I can improve my contents. Please let me know in the comments whether:

作为一名教育工作者,我喜欢听听如何改善自己的内容。 请在评论中让我知道是否:

  1. the visual illustration is helpful for understanding how the code works,视觉插图有助于理解代码的工作原理,
  2. the visual illustration is redundant and not necessary, OR whether视觉插图是多余的,不是必需的,或者
  3. the visual illustration complements the text-based explanation to help understand how the code works.视觉插图补充了基于文本的解释,以帮助理解代码的工作方式。

关于我 (About Me)

I work full-time as an Associate Professor of Bioinformatics and Head of Data Mining and Biomedical Informatics at a Research University in Thailand. In my after work hours, I’m a YouTuber (AKA the Data Professor) making online videos about data science. In all tutorial videos that I make, I also share Jupyter notebooks on GitHub (Data Professor GitHub page).

我是泰国研究大学的生物信息学副教授兼数据挖掘和生物医学信息学负责人,全职工作。 在下班后,我是YouTuber(又名数据教授 ),负责制作有关数据科学的在线视频。 在我制作的所有教程视频中,我也在GitHub上共享Jupyter笔记本( 数据教授GitHub页面 )。

在社交网络上与我联系 (Connect with Me on Social Network)

✅ YouTube: http://youtube.com/dataprofessor/✅ Website: http://dataprofessor.org/ (Under construction)✅ LinkedIn: https://www.linkedin.com/company/dataprofessor/✅ Twitter: https://twitter.com/thedataprof✅ FaceBook: http://facebook.com/dataprofessor/✅ GitHub: https://github.com/dataprofessor/✅ Instagram: https://www.instagram.com/data.professor/

✅的YouTube: http://youtube.com/dataprofessor/ ✅网站: http://dataprofessor.org/ (在建)✅LinkedIn: https://www.linkedin.com/company/dataprofessor/ ✅的Twitter: HTTPS: //twitter.com/thedataprof ✅Facebook的: http://facebook.com/dataprofessor/ ✅GitHub的: https://github.com/dataprofessor/ ✅Instagram: https://www.instagram.com/data.professor/

翻译自: https://towardsdatascience.com/how-to-build-a-regression-model-in-python-9a10685c7f09


http://www.taodudu.cc/news/show-863535.html

相关文章:

  • 循环神经网络 递归神经网络_了解递归神经网络中的注意力
  • 超参数优化 贝叶斯优化框架_mlmachine-使用贝叶斯优化进行超参数调整
  • 使用线性回归的预测建模
  • 机器学习 处理不平衡数据_在机器学习中处理不平衡数据
  • 目标检测迁移学习_使用迁移学习检测疟疾
  • 深度学习cnn人脸检测_用于对象检测的深度学习方法:解释了R-CNN
  • 人口预测和阻尼-增长模型_使用分类模型预测利率-第2部分
  • jupyter 共享_可共享的Jupyter笔记本!
  • 图像分割过分割和欠分割_使用图割的图像分割
  • 跳板机连接数据库_跳板数据科学职业生涯回顾
  • 模糊图像处理 去除模糊_图像模糊如何工作
  • 使用PyTorch进行手写数字识别,在20 k参数中获得99.5%的精度。
  • openai-gpt_您可以使用OpenAI GPT-3语言模型做什么?
  • 梯度下降和随机梯度下降_梯度下降和链链接系统
  • 三行情书代码_用三行代码优化您的交易策略
  • 词嵌入 网络嵌入_词嵌入简介
  • 如何成为数据科学家_成为数据科学家的5大理由
  • 大脑比机器智能_机器大脑的第一步
  • 嵌入式和非嵌入式_我如何向非技术同事解释词嵌入
  • ai与虚拟现实_将AI推向现实世界
  • bert 无标记文本 调优_使用BERT准确标记主观问答内容
  • 机器学习线性回归学习心得_机器学习中的线性回归
  • 安全警报 该站点安全证书_深度学习如何通过实时犯罪警报确保您的安全
  • 现代分层、聚集聚类算法_分层聚类:聚集性和分裂性-解释
  • 特斯拉自动驾驶使用的技术_使用自回归预测特斯拉股价
  • 熊猫分发_实用熊猫指南
  • 救命代码_救命! 如何选择功能?
  • 回归模型评估_评估回归模型的方法
  • gan学到的是什么_GAN推动生物学研究
  • 揭秘机器学习

如何在Python中建立回归模型相关推荐

  1. python中如何画logistic_如何在 Python 中建立和训练线性和 logistic 回归 ML 模型?

    原标题:如何在 Python 中建立和训练线性和 logistic 回归 ML 模型? 英语原文: 翻译:(Key.君思) 线性回归与logistic回归,是. 在我的里,你们已经学习了线性回归机器学 ...

  2. 如何在Python中建立和训练K最近邻和K-Means集群ML模型

    One of machine learning's most popular applications is in solving classification problems. 机器学习最流行的应 ...

  3. python多项式回归_如何在Python中实现多项式回归模型

    python多项式回归 Let's start with an example. We want to predict the Price of a home based on the Area an ...

  4. python 线性回归模型_如何在Python中建立和训练线性和逻辑回归ML模型

    python 线性回归模型 Linear regression and logistic regression are two of the most popular machine learning ...

  5. 如何在matlab中建立水箱模型_在MATLAB中实现水箱液位控制系统的设计

    在 MATLAB 中实现水箱液位控制系统的设计 [摘要] 本论文的目的是设计双容水箱液位串级控制系统. 在设计中充分利 用计算机技术, 自动控制技术, 以实现对水箱液位的串级控制. 首先对被控对象 的 ...

  6. django 传递中文_如何在Django中建立消息传递状态

    django 传递中文 by Ogundipe Samuel 由Ogundipe Samuel 如何在Django中建立消息传递状态 (How to Build a Message Delivery ...

  7. 大数据分析R中泊松回归模型实例

    如果您知道如何以及何时使用泊松回归,它可能是一个非常有用的工具.在大数据分析R中泊松回归模型实例中,我们将深入研究泊松回归,它是什么以及R程序员如何在现实世界中使用它. 具体来说,我们将介绍: 1)泊 ...

  8. 建立回归模型的完整步骤

    文章福利:Python学习精选书籍10本 建立回归模型的一般步骤如下图 1.具体(社会经济)问题 当我们想去解决一些现实生活.经济问题时,需要将具体问题量化成数据,然后通过观察与揭示事物(数据)之间的 ...

  9. 【机器学习基础】如何在Python中处理不平衡数据

    特征锦囊:如何在Python中处理不平衡数据 ???? Index 1.到底什么是不平衡数据 2.处理不平衡数据的理论方法 3.Python里有什么包可以处理不平衡样本 4.Python中具体如何处理 ...

最新文章

  1. POJ-1837 Balance
  2. Ural(Timus) 1146. Maximum Sum
  3. esp8266 wifi模组入网案例
  4. linux的文件io操作(转)
  5. jQuery的.live()和.die()
  6. 电脑屏幕卡住了按什么都没反应_手机突然“死机”了关机也不行,怎么按都没反应,怎么办?...
  7. 【操作系统】进程的状态与转换
  8. [转]设置修改CentOS系统时区
  9. [跪了]Servlet 工作原理解析
  10. HTTP中的POST、GET区别
  11. PS如何使用裁切工具
  12. HTTPS 免费证书,免费 ssl 证书,FreeSSL.cn 申请多种免费证书
  13. 咖啡产地及如何鉴赏评价
  14. 双绞线传输器的常见问题解析
  15. PHP实现给视频加图片水印,怎么在视频画面上加图片?如何给视频添加自己的图片作为水印?视频添加图片水印的方法...
  16. (附源码)anjule客户信息管理系统 毕业设计 181936
  17. springboot修改pdf内容
  18. Mysql 中的各种“删除”。删除数据库、删除表、删除字段
  19. 编程趣味知识:固执的“and”和变通的“or”
  20. Kafka实战《原理2》

热门文章

  1. 应用程序的日志通过rsyslog推送到syslog服务器
  2. UVA 11210 中国麻将
  3. hadoop中job.setOutputFormatClass(PartitionByCoun...
  4. Bottle 框架中的装饰器类和描述符应用
  5. rtsp协议_Chromium(3/5):rtsp客户端
  6. 多种思路给js文件传递参数
  7. android webview定位权限,混合开发安Android webview使用内置浏览器定位的权限
  8. linux 软件包管理设置,Linux速通08 网络原理及基础设置、软件包管理
  9. E - 嗯? 51Nod - 1432(二分)
  10. E - 连连看 HDU - 1175(思维的深搜)