Python中的线性回归：Sklearn与Excel

内部AI (Inside AI)

Around 13 years ago, Scikit-learn development started as a part of Google Summer of Code project by David Cournapeau. As time passed Scikit-learn became one of the most famous machine learning library in Python. It offers several classifications, regression and clustering algorithms and its key strength, in my opinion, is seamless integration with Numpy, Pandas and Scipy.

大约13年前，Scikit学习发展始于华氏度大卫Cournapeau代码项目的谷歌暑期的一部分。随着时间的流逝，Scikit-learn成为Python中最著名的机器学习库之一。我认为它提供了几种分类，回归和聚类算法，并且它的主要优势是与Numpy，Pandas和Scipy无缝集成。

In this article, I will compare the prediction accuracy of multiple linear regression of Scikit-learn with excel. Scikit-learn offers many parameters (known as hyper-parameters of an estimator) to fine-tune the training of the model and increase the accuracy of prediction. In the excel, we do not have much to tune the regression algorithm. For a fair comparison, I will train the sklearn regression model with default parameters.

在本文中，我将比较Scikit-learn与excel的多元线性回归的预测准确性。 Scikit学习提供许多参数(称为估算器的超参数)来微调模型的训练并提高预测的准确性。在Excel中，我们没有太多的调整回归算法。为了公平地比较，我将使用默认参数训练sklearn回归模型。

Objective

目的

This comparison aims to learn the prediction accuracy of the linear regression in excel and Scikit-learn. Also, I will touch briefly on the process to perform linear regression in excel.

该比较旨在了解excel和Scikit-learn中线性回归的预测准确性。另外，我将简要介绍在Excel中执行线性回归的过程。

Sample Data File

样本数据文件

For the comparison, we will use historical 100,000 readings of precipitation, minimum temperature, maximum temperature and wind speed, measured several times in a day for 8 years.

为了进行比较，我们将使用历史上100,000次的降水，最低温度，最高温度和风速的读数，在8年中每天进行几次测量。

We will use the precipitation, minimum temperature and maximum temperature to predict the wind speed. Hence, wind speed is the dependent variable, and other data is the independent variable.

我们将使用降水量，最低温度和最高温度来预测风速。因此，风速是因变量，其他数据是自变量。

We will first build and predict the wind speed with a linear regression model on excel. Then we will do the same exercise with Scikit-learn, and finally, we will compare the predicted results.

我们将首先在excel上使用线性回归模型构建和预测风速。然后，我们将使用Scikit学习进行相同的练习，最后，我们将比较预测结果。

To perform the linear regression in excel, we will open the sample data file and click the “Data” tab in excel ribbon. In the “Data” tab, select the Data Analysis option.

要在excel中执行线性回归，我们将打开示例数据文件，然后单击excel功能区中的“数据”标签。在“数据”选项卡中，选择“数据分析”选项。

Tip: In case you do not see the “Data Analysis” option then, click File > Options> Add-ins. Select the “Analysis Toolpak” and click the “Go” button as shown below

提示： 如果您没有看到“数据分析”选项，请单击文件>选项>加载项。 选择“ Analysis Toolpak”，然后单击“ Go”按钮，如下所示

On clicking the “Data Analysis” option, a pop-window will open up showing different analysis tools available in the excel. We will select the Regression and then click “OK”.

单击“数据分析”选项后，将打开一个弹出窗口，显示excel中可用的不同分析工具。我们将选择回归，然后单击“确定”。

Another pop-up window to provide the independent and dependent values series will be shown. Excel cell reference of wind speed (dependent variable) is mentioned in the “Input Y range” field. In “Input X Range” we will provide the cell reference for independent variables i.e. precipitation, minimum temperature and maximum temperature.

将显示另一个弹出窗口，提供独立和从属值系列。 “输入Y范围”字段中提到了风速(因变量)的Excel单元格参考。在“输入X范围”中，我们将为独立变量(例如降水，最低温度和最高温度)提供像元参考。

We need to select the checkbox “Label” as the first row in our sample data has variable names.

我们需要选中复选框“ Label”，因为示例数据的第一行具有变量名。

On clicking the “Ok” button after specifying the data, excel will build a linear regression model. You can consider it like training (fit option) in Scikit-learn coding.

指定数据后，单击“确定”按钮，excel将建立线性回归模型。您可以将其视为Scikit-learn编码中的训练(拟合选项)。

Excel does the calculations and shows the information in a nice format. In our example, excel could fit the linear regression model with R Square of 0.953. Considering 100,000 records in the training dataset, excel performed the linear regression in less than 7 seconds. Along with other statistical information, it also shows the intercepts and coefficients of different independent variables.

Excel进行计算，并以一种很好的格式显示信息。在我们的示例中，excel可以拟合线性回归模型，R Square为0.953。考虑到训练数据集中的100,000条记录，excel在不到7秒的时间内执行了线性回归。除其他统计信息外，它还显示了不同自变量的截距和系数。

Based on the excel linear regression output, we can put together the below mathematical relationship.

根据excel线性回归输出，我们可以将以下数学关系汇总起来。

Wind Speed = 2.438 + (Precipitation* 0.026) + (MinTemp*0.393)+ (MaxTemp*0.395)

风速= 2.438 +(降水* 0.026)+(最小温度* 0.393)+(最大温度* 0.395)

We will use this formula to predict the wind speed of the test data set, which excel regression model has not seen before.

我们将使用此公式来预测测试数据集的风速，这是excel回归模型之前从未见过的。

For example for the first test data set, Wind Speed= 2.438 + (0.51* 0.026) + (17.78*0.393)+ (25.56*0.395) = 19.55

例如对于第一个测试数据集，风速= 2.438 +(0.51 * 0.026)+(17.78 * 0.393)+(25.56 * 0.395)= 19.55

Further, we have calculated the residual of the prediction and plotted it to understand the trend of it. We can see that in nearly all cases the wind speed predicted is lower than the actual value and faster the wind speed higher is the error in the prediction.

此外，我们已经计算了预测的残差并将其绘制以了解其趋势。我们可以看到，几乎在所有情况下，预测的风速都低于实际值，而更快的风速是预测中的误差。

Windspeed Actual Vs Excel Linear Regression Residual Scatterplot

Let us not delve into linear regression in Scikit-learn.

让我们不要研究Scikit学习中的线性回归。

Step 1- We will import the packages which we are going to use for our analysis. Individual independent variables values are spread across different value ranges and not standard normally distributed, hence we need StandardScaler for standardization of independent variables.

第1步-我们将导入将用于分析的软件包。各个自变量的值分布在不同的值范围内，而不是标准的正态分布，因此我们需要StandardScaler对自变量进行标准化。

from sklearn.preprocessing import StandardScalerfrom sklearn.linear_model import LinearRegressionfrom sklearn.metrics import r2_scoreimport pandas as pdimport numpy as npimport matplotlib.pyplot as plt

Step 2- Read the training and test data from excel file into the PandasDataframe Training_data and Test_data respectively.

步骤2-将训练和测试数据从excel文件分别读取到PandasDataframe Training_data和Test_data中。

Training_data=pd.read_excel(“Weather.xlsx”, sheet_name=”Sheet1") Test_data=pd.read_excel(“Weather Test.xlsx”, sheet_name=”Sheet1")

I will not focus on preliminary data quality checks like blank values, outliers, etc. and respective correction approach in this article, and assuming that there are no data series related to the discrepancy.

在本文中，我将不着重于初步的数据质量检查，例如空白值，离群值等，以及相应的校正方法，并假设没有与差异有关的数据系列。

Please refer “How to identify the right independent variables for Machine Learning Supervised Algorithms?” for independent variable selection criteria and correlation analysis.

请参阅“ 如何为机器学习监督算法识别正确的自变量？ ”用于独立变量选择标准和相关性分析。

Step 3- In the below code, we have declared all the columns data except “WindSpeed” as the independent variable and only “WindSpeed” as the dependent variable for training and test data. Please note that we will not use “SourceData_test_dependent” for linear regression but to compare the predicted value with it.

步骤3-在下面的代码中，我们已声明所有列数据(“ WindSpeed”作为自变量，仅“ WindSpeed”作为因变量用于训练和测试数据)。请注意，我们不会将“ SourceData_test_dependent”用于线性回归，而是将其与预测值进行比较。

SourceData_train_independent= Training_data.drop(["WindSpeed"], axis=1) # Drop depedent variable from training datasetSourceData_train_dependent=Training_data["WindSpeed"].copy() #  New dataframe with only independent variable value for training datasetSourceData_test_independent=Test_data.drop(["WindSpeed"], axis=1)SourceData_test_dependent=Test_data["WindSpeed"].copy()

Step 4- As the independent variable ranges are quite disparate, hence we need to scale it to avoid the unintended influence of one variable. In the code below the independent train and test variable is scaled, and saved to X-train and X_test respectively. We neither need to scale training or testing dependent variable values. In y_train, the dependent trained variable is saved without scaling.

步骤4-由于自变量范围非常不同，因此我们需要对其进行缩放，以避免一个变量的意外影响。在下面的代码中，独立训练和测试变量被缩放，并分别保存到X-train和X_test。我们既不需要扩展训练规模，也不需要测试因变量值。在y_train中，将保存因变量而不缩放。

sc_X = StandardScaler()X_train=sc_X.fit_transform(SourceData_train_independent.values) #scale the independent variablesy_train=SourceData_train_dependent # scaling is not required for dependent variableX_test=sc_X.transform(SourceData_test_independent)y_test=SourceData_test_dependent

Step 5- Now we will feed the independent and dependent train data i.e. X_train and y_train respectively to train the linear regression model. We will perform the model fit with default parameters for the reasons mentioned at the start of the article.

步骤5-现在，我们将分别输入独立和相关的训练数据，即X_train和y_train，以训练线性回归模型。由于本文开头提到的原因，我们将使用默认参数执行模型拟合。

reg = LinearRegression().fit(X_train, y_train)print("The Linear regression score on training data is ", round(reg.score(X_train, y_train),2))

The Linear regression score on the training data is the same as we observed with excel.

训练数据上的线性回归得分与我们在excel中观察到的相同。

Step 6- Finally, we will predict the wind speed based on test independent value data sets.

步骤6-最后，我们将基于独立于测试的值数据集来预测风速。

predict=reg.predict(X_test)

Based on the predicted wind speed value and residual scatter plot we can see that Sklean predictions are more close to actual values.

根据预测的风速值和残留散点图，我们可以看到Sklean的预测更接近实际值。

Windspeed Actual Vs Sklearn Linear Regression Residual Scatterplot

On comparing the Sklearn and Excel residuals side by side, we can see that both the model deviated more from actual values as the wind speed increases but sklearn did better than excel.

通过并排比较Sklearn和Excel残差，我们可以看到，随着风速的增加，两个模型与实际值的偏差都更大，但是sklearn的表现要优于excel。

On a different note, excel did predict the wind speed similar value range like sklearn. If you an approximate linear regression model is good enough for your business case then to quickly predict the values excel comes across a very good option.

另一方面，Excel确实预测了风速类似sklearn的值范围。如果近似线性回归模型足以满足您的业务需求，则可以快速预测excel的值是一个很好的选择。

Actual Windspeed vs Residual Scatter Plot

Excel can perform linear regression prediction at the same accuracy level as sklearn is not the takeaway of this exercise. We can improve the sklearn linear regression prediction accuracy massively with fine-tuning of the parameters and it is more equipped to handle complex models. For quick and approximate prediction use cases excel is a very good alternative with acceptable accuracy.

Excel可以与sklearn相同的精度级别执行线性回归预测，而不是本练习的重点。通过参数的微调，我们可以大大提高sklearn线性回归预测的准确性，并且它更有能力处理复杂的模型。对于快速和近似的预测用例，excel是可以接受的准确度非常好的选择。

翻译自: https://towardsdatascience.com/linear-regression-in-python-sklearn-vs-excel-6790187dc9ca

查看全文

http://www.taodudu.cc/news/show-863698.html

机器学习中倒三角符号_机器学习的三角误差
使用Java解决您的数据科学问题
树莓派神经网络植入_使用自动编码器和TensorFlow进行神经植入
opencv 运动追踪_足球运动员追踪-使用OpenCV根据运动员的球衣颜色识别运动员的球队
犀牛建模软件的英文语言包_使用tidytext和textmineR软件包在R中进行主题建模（
使用Keras和TensorFlow构建深度自动编码器
出人意料的生日会400字_出人意料的有效遗传方法进行特征选择
fast.ai_使用fast.ai自组织地图—步骤4：使用Fast.ai DataBunch处理非监督数据
无监督学习与监督学习_有监督与无监督学习
分类决策树回归决策树_决策树分类器背后的数学
检测对抗样本_对抗T恤以逃避ML人检测器
机器学习中一阶段网络是啥_机器学习项目的各个阶段
目标检测 dcn v2_使用Detectron2分6步进行目标检测
生成高分辨率pdf_用于高分辨率图像合成的生成变分自编码器
神经网络激活函数对数函数_神经网络中的激活函数
算法伦理
python 降噪_使用降噪自动编码器重建损坏的数据（Python代码）
bert简介_BERT简介
卷积神经网络结构_卷积神经网络
html两个框架同时_两个框架的故事
深度学习中交叉熵_深度计算机视觉，用于检测高熵合金中的钽和铌碎片
梯度提升树python_梯度增强树回归— Spark和Python
5行代码可实现5倍Scikit-Learn参数调整的更快速度
tensorflow 多人_使用TensorFlow2.x进行实时多人2D姿势估计
keras构建卷积神经网络_在Keras中构建，加载和保存卷积神经网络
深度学习背后的数学_深度学习背后的简单数学
深度学习：在图像上找到手势_使用深度学习的人类情绪和手势检测器：第1部分
单光子探测技术应用_我如何最终在光学/光子学应用程序中使用机器学习作为博士学位
基于深度学习的病理_组织病理学的深度学习（第二部分）
ai无法启动产品_启动AI启动的三个关键教训

Python中的线性回归：Sklearn与Excel相关推荐

解决Python中加载sklearn人脸数据集出现的fetch_olivetti_faces HTTPError: HTTP Error : Forbidden
解决Python中加载sklearn人脸数据集出现的fetch_olivetti_faces HTTPError: HTTP Error : Forbidden 在使用Python进行机器学习或深度学 ...
python中计算均方误差_在python中查找线性回归的均方误差(使用scikit-learn)
我试图在python中做一个简单的线性回归,其中x变量是单词项目描述的计数,Y值是以天为单位的融资速度. 我有点困惑,因为测试的均方根误差(rmse)是13.77. 培训数据为13.88.首先,RM ...
Python中机器学习神器——sklearn模块
参考文章 Python机器学习笔记:sklearn库的学习 ML神器:sklearn的快速使用机器学习与Sklearn的初识传统的机器学习任务从开始到建模的一般流程是:获取数据 → 数据预处理 → ...
python计算均方根误差_如何在Python中创建线性回归机器学习模型？「入门篇」
线性回归和逻辑回归是当今很受欢迎的两种机器学习模型. 本文将教你如何使用 scikit-learn 库在Python中创建.训练和测试你的第一个线性.逻辑回归机器学习模型,本文适合大部分的新人小白. ...
python中execute函数_在excel中调用python函数
效果: 通过excel引用在py文件中写好的load_settle()函数,可以快捷的获取对应的历史结算价. 使用方法: 1.首先安装office,我用的是2016版本. 2.安装python,推荐使 ...
在python中建立线性回归
Yes, I know - there is a built in function in the python numpy module that does linear (and other po ...
python计算均方误差_在python中寻找线性回归的均方误差（使用scikit learn）
我试图用python做一个简单的线性回归,其中x变量是单词项目描述的计数,y值是以天为单位的融资速度.在我有点困惑,因为测试的均方根误差(RMSE)是13.77 训练数据为13.88.首先,RMS ...
python线性回归x可以数量不一样吗_R和Python中的线性回归 - 在同一问题上的结果不同...
只是指出这一点: statsmodel 's least squares fit does by default not include a constant. If we remove the co ...
Python中的openpyxl如何对excel修改文件
https://blog.csdn.net/hanhanwanghaha宝藏女孩欢迎您的关注! 欢迎关注微信公众号:宝藏女孩的成长日记如有转载,请注明出处(如不注明,盗者必究) Python安装o ...

Python中的线性回归：Sklearn与Excel

内部AI (Inside AI)

相关文章：

Python中的线性回归：Sklearn与Excel相关推荐

最新文章

热门文章