机器学习线性回归学习心得

机器学习中的线性回归 (Linear Regression in Machine Learning)

There are two types of supervised machine learning algorithms or task: Regression and classification.

有监督的机器学习算法或任务有两种类型:回归和分类。

• Classification — Classification is a process of categorizing a given set of data into classes, it can be performed on both structured or unstructured data. The process starts with predicting the class of given data points. The classes are often referred to as target, label or categories.

•分类-分类是将给定数据集分类为类的过程,可以在结构化或非结构化数据上执行。 该过程从预测给定数据点的类别开始。 这些类通常称为目标,标签或类别。

Example — Spam, Check defaulter in loan applicant.

示例—垃圾邮件,检查贷款申请人中的违约者。

• Regression — It predicts the continuous value based on historical data. It Predict the future values on the basis of historical data. Example — No of Corona Patients in July, Sales of car in 2021

•回归-根据历史数据预测连续值。 它根据历史数据预测未来价值。 示例— 7月的电晕患者人数,2021年的汽车销量

Linear equation in algebra is the function for this algorithm, we try to find a linear relationship between two or more variables. If we draw this relationship in a two-dimensional space, we get a straight line. (between two variables),

代数中的线性方程是该算法的函数,我们试图找到两个或多个变量之间的线性关系。 如果在二维空间中绘制此关系,则会得到一条直线。 (在两个变量之间),

It predicts the continuous variable “Y” based on given independent variable “X”. If we plot the independent variable (x) on the x-axis and dependent variable (y) on the y-axis. This algorithm gives us a straight line between y and x axis.

它基于给定的自变量“ X”预测连续变量“ Y”。 如果我们在x轴上绘制自变量(x),在y轴上绘制因变量(y)。 该算法为我们在y和x轴之间提供了一条直线。

So, our equation looks like below.

因此,我们的等式如下所示。

Y =θ0 +θ1.X (Y = θ0 + θ1.X)

Where, y is predicted value

y是预测值

X is input or Features or Columns

X是输入或特征或列

θ0 and θ1 are the model’s parameters. 1st one is intercept or Bias and 2nd one is slope or Weight.

θ0θ1是模型的参数。 第一个是拦截或偏差,第二个是斜率或权重。

You can see the equation looks like y = mx + c, which we have studied in our school curriculum's, where m is slope and c is intercept.

您会看到方程看起来像y = mx + c,这是我们在学校课程中研究的方程,其中m是斜率,c是截距。

More generally, a linear model makes a prediction by simply computing a weighted sum of the input features, plus a constant called the bias term (also called the intercept term), We can represent the equation for n independent variables or features or column.

更一般而言,线性模型通过简单地计算输入要素的加权总和加上一个称为偏差项 (也称为截距项 )的常数来进行预测。我们可以表示n个独立变量或要素或列的方程式。

Y =θ0 X 0 +θ1×1 +θ2×2 +⋯+θnxn (y = θ0 x0+ θ1x1 + θ2x2 + ⋯ + θnxn)

ŷ is the predicted value.

ŷ是预测值。

n is the number of features.

n是要素数量。

xi is the ith feature value. X0 is always 0

x i是第i个特征值。 X0始终为0

θj is the jth model parameter (including the bias term θ0 and the feature weights

•θj的是第j个模型参数(包括偏项θ0和特征权重

θ1, θ2, ⋯, θn).

θ1θ2 ,⋯, θn )。

Same we can write in Vector form.

同样,我们可以用矢量形式写。

Vector Form of the y predicted or Y-Hat
y预测或Y帽子的向量形式

• θ is the model’s parameter vector, containing the bias term θ0 and the feature

•θ是该模型的参数向量 ,含有偏项θ0和特征

weights θ1 to θn.

权重为θ1θn

• x is the instance’s feature vector, containing x0 to xn, with x0 always equal to 1.

•x是实例的特征向量 ,从x 0到xnx 0始终等于1。

• θ ・ x is the dot product of the vectors θ and x, which is equal to

•θ・ x是向量θ和x的点积,等于

θ0x0 + θ1x1 + θ2x2 + ⋯ + θnxn.

θ0 X 0 +θ1×1 +θ2×2 +⋯+θnxn。

hθ is the hypothesis function, using the model parameters θ.

水平 θ是设定功能,利用该模型参数θ。

In Machine Learning, vectors are often represented as column vectors, which are 2D arrays with a single column. If θ and x are column vectors, then the prediction is: y = θT x, where θT is the transpose of θ (Swap the rows to column or column to rows) and θT x is the matrix multiplication of θT and x. It is of course the same prediction, except it is now represented as a single cell matrix rather than a scalar value.

在机器学习中,向量通常表示为列向量,它们是具有单个列的2D数组。 如果θ和x是列向量,则预测为:y =θTx,其中θT是θ的转置(将行交换到列或将列交换到行),θTx是θT和x的矩阵乘法。 当然,它是相同的预测,只不过它现在表示为单个单元矩阵而不是标量值。

Transpose — Swapping the rows to column of matrix or vice versa

转置—将行交换到矩阵的列,反之亦然

A is Matrix and AT is Transpose of matrix A
A是矩阵,AT是矩阵A的转置

准备数据进行线性回归 (Preparing Data for Linear Regression)

For an algorithm, data is most crucial part. There is saying “Garbage in, garbage out”. It means if you feed the garbage (Irrelevant data or noisy data), output of your model will be garbage (inaccurate). Hence, we need data to be in best form, before feeding and testing the model.

对于算法而言,数据是最关键的部分。 有人说“垃圾进,垃圾出”。 这意味着如果您输入垃圾(Irelevant数据或嘈杂数据),则模型的输出将是垃圾(不准确)。 因此,在馈送和测试模型之前,我们需要数据以最佳形式出现。

Below are few techniques to clean scale and modify data for Linear Regression. Linear Regression assumes below points like data is noise less and relationship between independent variable and dependent variables.

以下是一些清理线性比例和修改数据以线性回归的技术。 线性回归假设以下几点,例如数据更少噪音以及自变量和因变量之间的关系。

1-> Linear Assumption: Linear regression assumes that the relationship between your independent variable or X (input) and dependent variable or Y (output) is linear or it tends to be a linear for better performance. If relationship is not linear, Transformation can be applied on data.

1->线性假设:线性回归假设您的自变量或X(输入)与因变量或Y(输出)之间的关系是线性的,或者为了获得更好的性能而倾向于线性。 如果关系不是线性的,则可以将转换应用于数据。

2-> Remove Collinearity: If your independent variables have relation to each other, we should just take the most co related to Y and remove rest related to that X variable. Example If your data has DOB and Age, we can remove one.

2->删除共线性:如果您的自变量相互之间具有关联,则我们应仅取与Y相关的最大co并删除与该X变量相关的其余co。 示例如果您的数据具有DOB和Age,我们可以删除其中一个。

3-> Gaussian Distributions: Linear regression will make more reliable predictions if your input and output variables have a Gaussian distribution or bell-shaped curve. You may get some benefit using transforms on your variables to make their distribution more Gaussian looking.

3->高斯分布:如果您的输入和输出变量具有高斯分布或钟形曲线,则线性回归将提供更可靠的预测。 通过对变量进行变换,使它们的分布更具高斯外观,您可能会获得一些好处。

4-> Remove Noise: Linear regression assumes that your input and output variables are not noisy. We need to clean the data before feeding it to the model. This is most important for the output variable and you want to remove outliers in the dependent variable or Y if possible.

4->消除噪声:线性回归假设您的输入和输出变量没有噪音。 我们需要先清理数据,然后再将其提供给模型。 这对于输出变量最重要,如果可能,您要删除因变量中的异常值或Y。

5-> Rescale Inputs: Linear regression will often make more reliable predictions if you rescale input variables using standardization or normalization y.

5->重新缩放输入:如果您使用标准化或归一化y重新缩放输入变量,则线性回归通常会做出更可靠的预测。

Medium Story on Standardscaler vs MinMax Scaler (Normalization) — https://medium.com/@amitupadhyay6/standardscaler-and-normalization-with-code-and-graph-ba220025c054

关于Standardscaler与MinMax Scaler(规范化)的中级故事— https://medium.com/@amitupadhyay6/standardscaler-and-normalization-with-code-and-graph-ba220025c054

选择绩效指标 (Select a Performance Measure)

Our model is ready to predict the value, but before putting it in production, we need to check the performance of the model. For this purpose, we first need a measure of how well (or poorly) the model fits the training data. Most common performance measure of a regression model is the Root Mean Square Error (RMSE). It gives an idea of how much error the system typically makes in its predictions, with a higher weight for large errors. Therefore, to train a Linear Regression model, you need to find the value of θ that minimizes the RMSE. In practice, it is simpler to minimize the Mean Square Error (MSE) than the RMSE, and it leads to the same result.

我们的模型已经可以预测价值了,但是在将其投入生产之前,我们需要检查模型的性能。 为此,我们首先需要衡量模型拟合训练数据的好坏程度。 回归模型最常见的性能指标是均方根误差(RMSE)。 它给出了系统通常会在预测中产生多少错误的想法,对于较大的错误,权重较高。 因此,要训练线性回归模型,您需要找到使RMSE最小的θ值。 实际上,最小均方误差(MSE)比RMSE更简单,并且得到相同的结果。

Root Mean Square Error:

根均方误差:

The RMSE is the square root of the variance of the residuals. It indicates the absolute fit of the model to the data–how close the observed data points are to the model’s predicted values. Whereas R-squared is a relative measure of fit, RMSE is an absolute measure of fit.

RMSE是残差方差的平方根。 它表示模型与数据的绝对拟合-观察到的数据点与模型的预测值有多接近。 R平方是拟合的相对度量,而RMSE是拟合的绝对度量。

RMSE Equation
RMSE方程

Mean Square Error:

均方误差:

The mean squared error tells you how close a regression line is to a set of points. It does this by taking the distances from the points to the regression line (these distances are the “errors”) and squaring them. It’s called the mean squared error as you’re finding the average of a set of errors.

均方误差告诉您回归线与一组点的接近程度。 它通过获取从点到回归线的距离(这些距离就是“误差”)并对它们进行平方来实现。 当您找到一组误差的平均值时,这称为均方误差。

MSE Equation
MSE方程

Mean Absolute Error:

平均绝对误差:

Even though the RMSE is generally the preferred performance measure for regression tasks, in some contexts you may prefer to use another function. For example, suppose that there are many outliers in your data. In that case, you may consider using the Mean Absolute Error .

尽管RMSE通常是回归任务的首选性能指标,但在某些情况下,您可能更喜欢使用其他功能。 例如,假设您的数据中有许多异常值。 在这种情况下,您可以考虑使用平均绝对误差。

Mean Absolute Error (MAE) is another loss function used for regression models. MAE is the sum of absolute differences between our target and predicted variables. So it measures the average magnitude of errors in a set of predictions, without considering their directions

平均绝对误差(MAE)是用于回归模型的另一个损失函数。 MAE是我们的目标变量和预测变量之间的绝对差之和。 因此,它可以测量一组预测中的平均误差幅度,而无需考虑其方向

MAE Equation
MAE方程
Error in Prediction by Linear Regression
线性回归预测中的误差

• Computing the root of a sum of squares (RMSE) corresponds to the Euclidean

•计算平方根(RMSE)的根对应于欧几里得

norm: It is also called the ℓ2 norm, noted ∥ ・ ∥2 (or just ∥ ・ ∥).

规范:也称为ℓ2规范,记为∥・∥2(或仅称为∥・∥)。

• Computing the sum of absolutes (MAE) corresponds to the ℓ1 norm, noted ∥ ・ ∥1.

•计算绝对和(MAE)对应于ℓ1m1范数。

It is also called the Manhattan norm because it measures the distance between two points.

之所以称为曼哈顿范数,是因为它测量了两个点之间的距离。

Euclidean and Manhattan Distance
欧几里得距离与曼哈顿距离

The most common interpretation of r-squared is how well the regression model fits the observed data. For example, an r-squared of 60% reveals that 60% of the data fit the regression model. Generally, a higher r-squared indicates a better fit for the model.

r平方的最常见解释是回归模型对观测数据的拟合程度。 例如,r-平方为60%表示60%的数据符合回归模型。 通常,较高的r平方表示该模型更合适。

R-squared or R2 explains the degree to which your input variables explain the variation of your output / predicted variable. So, if R-square is 0.8, it means 80% of the variation in the output variable is explained by the input variables. So, in simple terms, higher the R squared, the more variation is explained by your input variables and hence better is your model.

R平方或R2解释输入变量解释输出/预测变量变化的程度。 因此,如果R平方为0.8,则意味着输出变量中80%的变化由输入变量解释。 因此,简单来说,R平方越高,输入变量说明的变化越大,因此您的模型越好。

Where SSres, Residual sum of squared errors of our regression model or error

其中SSres,我们回归模型的平方误差的残差和或

R2 Square Equation
R2平方方程

& SStot, is the total sum of squared errors

&SStot,是平方误差的总和

Adjusted R Square:

调整后的R平方:

However, the problem with R-squared is that it will either stay the same or increase with addition of more variables, even if they do not have any relationship with the output variables. This is where “Adjusted R square” comes to help. Adjusted R-square penalizes you for adding variables which do not improve your existing model.

但是,R平方的问题在于,即使它们与输出变量没有任何关系,它也会保持不变或随着添加更多变量而增加。 这是“调整后的R平方”提供帮助的地方。 调整后的R平方会惩罚您添加不会改善现有模型的变量。

Hence, if you are building Linear regression on multiple variable, it is always suggested that you use Adjusted R-squared to judge goodness of model. In case you only have one input variable, R-square and Adjusted R squared would be exactly same.

因此,如果要在多个变量上建立线性回归,则始终建议您使用调整后的R平方来判断模型的优劣。 如果只有一个输入变量,则R平方和调整后的R平方将完全相同。

Typically, the more non-significant variables you add into the model, the gap in R-squared and Adjusted R-squared increases.

通常,添加到模型中的变量越不重要,R平方和调整R平方的差距就会增加。

Adjusted R2 Square Equation
调整后的R2平方方程

As we know, whenever we add new feature or input variable, R2 square value will stay same or get increases, which is good sign, higher the r2 value, higher the performance of our model, but if you are adding the insignificant variable or feature, even though r2 value will remain same or gets increases. which should not happen, hence we use adjusted r2 square, lets see both the cases.

众所周知,每当我们添加新特征或输入变量时,R2平方值将保持不变或增加,这是一个好兆头,r2值越高,模型的性能越高,但是如果添加的变量或特征不重要, ,即使r2值保持不变或增加。 这不应该发生,因此我们使用调整后的r2平方,让我们看一下这两种情况。

1-> Adding relevant feature or input variable in the data set — If you are adding relevant feature, r2 value will gets increase, which means (1-r2) will be small value and if you multiply the small value with big number { (1-r2) * (n-1)/(n-p-1)}, value will be reduced. ex 10 * 2 = 20 and 10 * 0.5 = 5, hence (1 - { (1-r2) * (n-1)/(n-p-1)}) will be bigger number as we are subtracting the small number from 1.

1->在数据集中添加相关特征或输入变量-如果要添加相关特征,则r2值将增加,这意味着(1-r2)将为小值,并且将小值与大数{( 1-r2)*(n-1)/(np-1)},值将减小。 ex 10 * 2 = 20和10 * 0.5 = 5,因此(1-{(1-r2)*(n-1)/(np-1)})将是较大的数字,因为我们要从1中减去较小的数字。

2-> Adding irrelevant feature or input variable in the data set — If you are adding irrelevant feature, r2 value will remain same or gets increase slightly, which means (1-r2) will be little less, but if we see the denominator (n-p-1), this value will be reduced, as the p is increases. hence nominator {(1-r2)(n-1)} divide small value, will make the bigger value, ex 10/2 = 5, reduce the denominator 10/0.5 = 20. Hence 1 — bigger number the resultant adjusted r2 value will be reduced. Which should suppose to be happen in the case of irrelevant feature addition in the data set.

2->在数据集中添加不相关的要素或输入变量-如果要添加不相关的要素,则r2值将保持不变或略有增加,这意味着(1-r2)会少一点,但是如果看到分母( np-1),随着p的增加,该值将减小。 因此,分母{(1-r2)(n-1)}除以较小的值,将得到较大的值,例如10/2 = 5,则分母减少10 / 0.5 =20。因此,1 —较大的数字就是调整后的r2值将减少。 如果在数据集中添加了不相关的功能,则应该发生这种情况。

  • YouTube Link: Linear Regression → https://www.youtube.com/watch?v=w8S0uTLTaGA

    YouTube链接:线性回归→ https : //www.youtube.com/watch?v=w8S0uTLTaGA

  • Performance measure → https://www.youtube.com/watch?v=eNphW-kjT2I&list=PLbbnl6egUbNhJqmLfwX2eN7XmuqRS8rv8&index=15

    绩效指标→ https://www.youtube.com/watch?v=eNphW-kjT2I&list=PLbbnl6egUbNhJqmLfwX2eN7XmuqRS8rv8&index=15

  • GitHub Code → https://github.com/amitupadhyay6/My-Python/blob/master/Linear%20Regression%20on%20Boston.ipynb

    GitHub代码→ https://github.com/amitupadhyay6/My-Python/blob/master/Linear%20Regression%20on%20Boston.ipynb

翻译自: https://medium.com/analytics-vidhya/linear-regression-in-machine-learning-eeee4dbc8bae

机器学习线性回归学习心得


http://www.taodudu.cc/news/show-863513.html

相关文章:

  • 安全警报 该站点安全证书_深度学习如何通过实时犯罪警报确保您的安全
  • 现代分层、聚集聚类算法_分层聚类:聚集性和分裂性-解释
  • 特斯拉自动驾驶使用的技术_使用自回归预测特斯拉股价
  • 熊猫分发_实用熊猫指南
  • 救命代码_救命! 如何选择功能?
  • 回归模型评估_评估回归模型的方法
  • gan学到的是什么_GAN推动生物学研究
  • 揭秘机器学习
  • 投影仪投影粉色_DecisionTreeRegressor —停止用于将来的投影!
  • 机器学习中的随机过程_机器学习过程
  • ci/cd heroku_在Heroku上部署Dash或Flask Web应用程序。 简易CI / CD。
  • 图像纹理合成_EnhanceNet:通过自动纹理合成实现单图像超分辨率
  • 变压器耦合和电容耦合_超越变压器和抱抱面的分类
  • 梯度下降法_梯度下降
  • 学习机器学习的项目_辅助项目在机器学习中的重要性
  • 计算机视觉知识基础_我见你:计算机视觉基础知识
  • 配对交易方法_COVID下的自适应配对交易,一种强化学习方法
  • 设计数据密集型应用程序_设计数据密集型应用程序书评
  • pca 主成分分析_超越普通PCA:非线性主成分分析
  • 全局变量和局部变量命名规则_变量范围和LEGB规则
  • dask 使用_在Google Cloud上使用Dask进行可扩展的机器学习
  • 计算机视觉课_计算机视觉教程—第4课
  • 用camelot读取表格_如何使用Camelot从PDF提取表格
  • c盘扩展卷功能只能向右扩展_信用风险管理:功能扩展和选择
  • 使用OpenCV,Keras和Tensorflow构建Covid19掩模检测器
  • 使用Python和OpenCV创建自己的“ CamScanner”
  • cnn图像进行预测_CNN方法:使用聚合物图像预测其玻璃化转变温度
  • 透过性别看世界_透过树林看森林
  • gan神经网络_神经联觉:当艺术遇见GAN
  • rasa聊天机器人_Rasa-X是持续改进聊天机器人的独特方法

机器学习线性回归学习心得_机器学习中的线性回归相关推荐

  1. 逻辑回归和线性回归的区别_机器学习简介之基础理论- 线性回归、逻辑回归、神经网络...

    本文主要介绍一些机器学习的基础概念和推导过程,并基于这些基础概念,快速地了解当下最热技术AI的核心基础-神经网络. 主要分为三大部分:线性回归,逻辑回归,神经网络. 首先看下机器学习的定义及常用的分类 ...

  2. 交互式机器学习/ 强化学习在图像领域中的应用

    交互式机器学习 参考: 深度学习在交互式图像分割中的应用 - 知乎 Nat. Methods | ilastik:为生物图像分析而生的交互式机器学习平台_DrugAI-CSDN博客 https://d ...

  3. 数据科学学习心得_学习数据科学

    数据科学学习心得 苹果 | GOOGLE | 现货 | 其他 (APPLE | GOOGLE | SPOTIFY | OTHERS) Editor's note: The Towards Data S ...

  4. 机器学习解决什么问题_机器学习帮助解决水危机

    机器学习解决什么问题 According to Water.org and Lifewater International, out of 57 million people in Tanzania, ...

  5. 数据科学学习心得_学习数据科学时如何保持动力

    数据科学学习心得 When trying to learn anything all by yourself, it is easy to lose motivation and get thrown ...

  6. 机器学习模型 知乎_机器学习:模型评估之评估方法

    ​机器学习已经成为了人工智能的核心研究领域之一,它的研究动机就是为了让计算机系统具有人的学习能力以便实现人工智能.目前,关于机器学习定义的说法比较多,而被广泛采用的定义是"利用经验来改善计算 ...

  7. 书籍排版学习心得_为什么排版是您可以学习的最佳技能

    书籍排版学习心得 重点 (Top highlight) I was introduced to design in a serpentine fashion. I don't have any for ...

  8. 数据科学学习心得_如何快速学习数据科学

    数据科学学习心得 Learning R can take a lot of time. But while it's impossible to become an expert overnight, ...

  9. 学习心得_【数字建行大学学习心得】第二期

    建行大学"数字学习在线研讨峰会"经过四周紧张有序的进行,已画上完美句号,数字建行大学的崭新时代已开启. 通过建行大学网络学习平台,学习的时空界限被打破,我们通过网络直播与大师相遇, ...

最新文章

  1. 你,的寒假作业写多少了?
  2. 信号与系统 chapter12 卷积及其性质
  3. DataGridView数据导入到Excel 中
  4. Timus 1015. Test the Difference!
  5. hiho1015(kmp+统计出现次数)
  6. 文档根元素 mapper 必须匹配 DOCTYPE 根 configuration
  7. code review平台Rietveld应用指南
  8. cisco2811 pppoe上网配置供参考
  9. 牛客网Python笔试技巧、单行多行输入方法以及代码调试技巧
  10. Java多线程系列--【JUC集合02】- CopyOnWriteArrayList
  11. dw 用html修改文字样式,Dreamweaver中插入文本以及文本格式设置方法?
  12. 人工智能实验二——prolog语言求解渡河问题(传教士和野人渡河,农夫渡河问题)实现详解
  13. 安卓11客制需求:用户无操作一段时间,自动播放客户提供的视频,用户操作后退出播放
  14. GIS应用技巧之矢量数据编辑
  15. NVME-MI 学习记录_1 框架
  16. 好友列表页面java_怎样制作QQ好友列表的界面?
  17. 轻轻松松做演讲的小窍门
  18. 2018杭州应届生php起薪,2018应届毕业生起薪排名出炉,这个专业起薪最高!
  19. 渗透测试之Nmap命令(四) 使用诱饵
  20. Android基础之Fragment

热门文章

  1. CentOS 5 安装免费虚拟主机管理系统Kloxo
  2. ASP.NET 2.0:如何让DropDownList同时拥有数据来源项目与自订项目 (转自章立民CnBlogs)...
  3. js滑动到底部加载更多
  4. 备份数据库的expdp语句_Oracle中利用expdp/impdp备份数据库的使用说明
  5. EJB3.0 JPQL
  6. zhs16gbk对应mysql_数据库的编码浅谈(ZHS16GBK与US7ASCII)
  7. Ehab and another construction problem(水题)
  8. hdu 1408(高精度)坑人嫩
  9. 久等了,「阿里妈妈技术」来啦!
  10. linux ejb远程调用,[转载]在容器外使用EJB 3.0 Persistence