项目github地址:bitcarmanlee easy-algorithm-interview-and-practice
欢迎大家star,留言,一起学习进步

线性回归与线性分类在实际工作中使用的频率非常高,mllib中对这两大类算法自然也有完整实现。现在我们就结合相关源码来对着两大类算法进行分析。本文先分析线性回归。

二话不说,先上源码。看优秀项目的源码本身就是一种巨大的享受。为了控制篇幅,将一些注释以及import内容先行省略。

1.GeneralizedLinearAlgorithm类源码

/*** :: DeveloperApi ::* GeneralizedLinearAlgorithm implements methods to train a Generalized Linear Model (GLM).* This class should be extended with an Optimizer to create a new GLM.**/
@Since("0.8.0")
@DeveloperApi
abstract class GeneralizedLinearAlgorithm[M <: GeneralizedLinearModel]extends Logging with Serializable {protected val validators: Seq[RDD[LabeledPoint] => Boolean] = List()/*** The optimizer to solve the problem.**/@Since("0.8.0")def optimizer: Optimizer/** Whether to add intercept (default: false). */protected var addIntercept: Boolean = falseprotected var validateData: Boolean = true/*** In `GeneralizedLinearModel`, only single linear predictor is allowed for both weights* and intercept. However, for multinomial logistic regression, with K possible outcomes,* we are training K-1 independent binary logistic regression models which requires K-1 sets* of linear predictor.** As a result, the workaround here is if more than two sets of linear predictors are needed,* we construct bigger `weights` vector which can hold both weights and intercepts.* If the intercepts are added, the dimension of `weights` will be* (numOfLinearPredictor) * (numFeatures + 1) . If the intercepts are not added,* the dimension of `weights` will be (numOfLinearPredictor) * numFeatures.** Thus, the intercepts will be encapsulated into weights, and we leave the value of intercept* in GeneralizedLinearModel as zero.*/protected var numOfLinearPredictor: Int = 1/*** Whether to perform feature scaling before model training to reduce the condition numbers* which can significantly help the optimizer converging faster. The scaling correction will be* translated back to resulting model weights, so it's transparent to users.* Note: This technique is used in both libsvm and glmnet packages. Default false.*/private var useFeatureScaling = false/*** The dimension of training features.**/@Since("1.4.0")def getNumFeatures: Int = this.numFeatures/*** The dimension of training features.*/protected var numFeatures: Int = -1/*** Set if the algorithm should use feature scaling to improve the convergence during optimization.*/private[mllib] def setFeatureScaling(useFeatureScaling: Boolean): this.type = {this.useFeatureScaling = useFeatureScalingthis}/*** Create a model given the weights and intercept*/protected def createModel(weights: Vector, intercept: Double): M/*** Get if the algorithm uses addIntercept**/@Since("1.4.0")def isAddIntercept: Boolean = this.addIntercept/*** Set if the algorithm should add an intercept. Default false.* We set the default to false because adding the intercept will cause memory allocation.**/@Since("0.8.0")def setIntercept(addIntercept: Boolean): this.type = {this.addIntercept = addInterceptthis}/*** Set if the algorithm should validate data before training. Default true.**/@Since("0.8.0")def setValidateData(validateData: Boolean): this.type = {this.validateData = validateDatathis}/*** Run the algorithm with the configured parameters on an input* RDD of LabeledPoint entries.**/@Since("0.8.0")def run(input: RDD[LabeledPoint]): M = {if (numFeatures < 0) {numFeatures = input.map(_.features.size).first()}/*** When `numOfLinearPredictor > 1`, the intercepts are encapsulated into weights,* so the `weights` will include the intercepts. When `numOfLinearPredictor == 1`,* the intercept will be stored as separated value in `GeneralizedLinearModel`.* This will result in different behaviors since when `numOfLinearPredictor == 1`,* users have no way to set the initial intercept, while in the other case, users* can set the intercepts as part of weights.** TODO: See if we can deprecate `intercept` in `GeneralizedLinearModel`, and always* have the intercept as part of weights to have consistent design.*/val initialWeights = {if (numOfLinearPredictor == 1) {Vectors.zeros(numFeatures)} else if (addIntercept) {Vectors.zeros((numFeatures + 1) * numOfLinearPredictor)} else {Vectors.zeros(numFeatures * numOfLinearPredictor)}}run(input, initialWeights)}/*** Run the algorithm with the configured parameters on an input RDD* of LabeledPoint entries starting from the initial weights provided.**/@Since("1.0.0")def run(input: RDD[LabeledPoint], initialWeights: Vector): M = {if (numFeatures < 0) {numFeatures = input.map(_.features.size).first()}if (input.getStorageLevel == StorageLevel.NONE) {logWarning("The input data is not directly cached, which may hurt performance if its"+ " parent RDDs are also uncached.")}// Check the data properties before running the optimizerif (validateData && !validators.forall(func => func(input))) {throw new SparkException("Input validation failed.")}/*** Scaling columns to unit variance as a heuristic to reduce the condition number:** During the optimization process, the convergence (rate) depends on the condition number of* the training dataset. Scaling the variables often reduces this condition number* heuristically, thus improving the convergence rate. Without reducing the condition number,* some training datasets mixing the columns with different scales may not be able to converge.** GLMNET and LIBSVM packages perform the scaling to reduce the condition number, and return* the weights in the original scale.* See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf** Here, if useFeatureScaling is enabled, we will standardize the training features by dividing* the variance of each column (without subtracting the mean), and train the model in the* scaled space. Then we transform the coefficients from the scaled space to the original scale* as GLMNET and LIBSVM do.** Currently, it's only enabled in LogisticRegressionWithLBFGS*/val scaler = if (useFeatureScaling) {new StandardScaler(withStd = true, withMean = false).fit(input.map(_.features))} else {null}// Prepend an extra variable consisting of all 1.0's for the intercept.// TODO: Apply feature scaling to the weight vector instead of input data.val data =if (addIntercept) {if (useFeatureScaling) {input.map(lp => (lp.label, appendBias(scaler.transform(lp.features)))).cache()} else {input.map(lp => (lp.label, appendBias(lp.features))).cache()}} else {if (useFeatureScaling) {input.map(lp => (lp.label, scaler.transform(lp.features))).cache()} else {input.map(lp => (lp.label, lp.features))}}/*** TODO: For better convergence, in logistic regression, the intercepts should be computed* from the prior probability distribution of the outcomes; for linear regression,* the intercept should be set as the average of response.*/val initialWeightsWithIntercept = if (addIntercept && numOfLinearPredictor == 1) {appendBias(initialWeights)} else {/** If `numOfLinearPredictor > 1`, initialWeights already contains intercepts. */initialWeights}val weightsWithIntercept = optimizer.optimize(data, initialWeightsWithIntercept)val intercept = if (addIntercept && numOfLinearPredictor == 1) {weightsWithIntercept(weightsWithIntercept.size - 1)} else {0.0}var weights = if (addIntercept && numOfLinearPredictor == 1) {Vectors.dense(weightsWithIntercept.toArray.slice(0, weightsWithIntercept.size - 1))} else {weightsWithIntercept}/*** The weights and intercept are trained in the scaled space; we're converting them back to* the original scale.** Math shows that if we only perform standardization without subtracting means, the intercept* will not be changed. w_i = w_i' / v_i where w_i' is the coefficient in the scaled space, w_i* is the coefficient in the original space, and v_i is the variance of the column i.*/if (useFeatureScaling) {if (numOfLinearPredictor == 1) {weights = scaler.transform(weights)} else {/*** For `numOfLinearPredictor > 1`, we have to transform the weights back to the original* scale for each set of linear predictor. Note that the intercepts have to be explicitly* excluded when `addIntercept == true` since the intercepts are part of weights now.*/var i = 0val n = weights.size / numOfLinearPredictorval weightsArray = weights.toArraywhile (i < numOfLinearPredictor) {val start = i * nval end = (i + 1) * n - { if (addIntercept) 1 else 0 }val partialWeightsArray = scaler.transform(Vectors.dense(weightsArray.slice(start, end))).toArraySystem.arraycopy(partialWeightsArray, 0, weightsArray, start, partialWeightsArray.size)i += 1}weights = Vectors.dense(weightsArray)}}// Warn at the end of the run as well, for increased visibility.if (input.getStorageLevel == StorageLevel.NONE) {logWarning("The input data was not directly cached, which may hurt performance if its"+ " parent RDDs are also uncached.")}// Unpersist cached dataif (data.getStorageLevel != StorageLevel.NONE) {data.unpersist(false)}createModel(weights, intercept)}
}

2.继承GeneralizedLinearAlgorithm的相关类

在IDE里查看一下类继承关系,可以查找出继承GeneralizedLinearAlgorithm的类有:

在网上找了一张图,比较清晰地描述了各个类之间的关系:

3.GeneralizedLinearAlgorithm源码分析

源码中一上来的注释很重要,直接告诉了我们本类的用途与使用方式:
GeneralizedLinearAlgorithm implements methods to train a Generalized Linear Model (GLM).
这句话告诉我们,GeneralizedLinearAlgorithm是用来训练一个线性模型GML的。
This class should be extended with an Optimizer to create a new GLM.
这句话很重要,提示我们这个类必须被一个带有Optimizer的类继承并且得到一个新的GLM。其实从源码中我们也能很容易看出,因为这是个抽象类,抽象类的作用就是用来被继承然后实现的!

一开始定义一些属性(选择一些重要的):
validators:注意输入是Seq[RDD[LabeledPoint] => Boolean]
addIntercept:是否添加截距
useFeatureScaling:是否进行FeatureScaling
有关FeatureScaling的作用,请看这一行注释:

Whether to perform feature scaling before model training to reduce the condition numbers which can significantly help the optimizer converging faster. The scaling correction will be translated back to resulting model weights, so it's transparent to users.Note: This technique is used in both libsvm and glmnet packages. Default false.

从这段注释可以看出,featurescaling的作用是为了减小条件数,使得优化求解过程更快更稳定。如果你不知道什么是条件数,请参看http://blog.csdn.net/bitcarmanlee/article/details/51945271 一文。

里面最重要的就是run方法了。

    if (numFeatures < 0) {numFeatures = input.map(_.features.size).first()}

此为获得特征的维度数量

    if (input.getStorageLevel == StorageLevel.NONE) {logWarning("The input data is not directly cached, which may hurt performance if its"+ " parent RDDs are also uncached.")}// Check the data properties before running the optimizerif (validateData && !validators.forall(func => func(input))) {throw new SparkException("Input validation failed.")}

这都是检测输入的样本

    val scaler = if (useFeatureScaling) {new StandardScaler(withStd = true, withMean = false).fit(input.map(_.features))} else {null}

这部分是通过useFeatureScaling,达到样本标准化的目的!

    val data =if (addIntercept) {if (useFeatureScaling) {input.map(lp => (lp.label, appendBias(scaler.transform(lp.features)))).cache()} else {input.map(lp => (lp.label, appendBias(lp.features))).cache()}} else {if (useFeatureScaling) {input.map(lp => (lp.label, scaler.transform(lp.features))).cache()} else {input.map(lp => (lp.label, lp.features))}}

这一段很明显,是添加偏置项!

    val initialWeightsWithIntercept = if (addIntercept && numOfLinearPredictor == 1) {appendBias(initialWeights)} else {/** If `numOfLinearPredictor > 1`, initialWeights already contains intercepts. */initialWeights}

由变量命名就可以看出,这是初始化权重,

    val weightsWithIntercept = optimizer.optimize(data, initialWeightsWithIntercept)

请注意:真正的求解,就是这一行代码!通过optimizer里的optmize方法求解各特征的权重!
重要的事情再强调一下:真正的求解就这一行代码!

后续的代码就是对求解以后各特征权重的处理了!获取偏置项权重,获取各维度的权重!如果进行了useFeatureScaling,需要将得到的权重进行还原!

看看最后一行代码:

    createModel(weights, intercept)

调用createModel方法返回模型!是不是我们一上来就说了,GLA是用来返回一个新的GLM(GeneralizedLinearModel)的!

再具体一点,到底怎么得到一个GLM呢?
很简单,看看createModel方法就行了

  protected def createModel(weights: Vector, intercept: Double): M

是不是特别简单?只需要将各特征的权重以及截距的权重输入,然后调用createModel方法,就可以得到一个GLM了!

4、GeneralizedLinearModel

/*** :: DeveloperApi ::* GeneralizedLinearModel (GLM) represents a model trained using* GeneralizedLinearAlgorithm. GLMs consist of a weight vector and* an intercept.** @param weights Weights computed for every feature.* @param intercept Intercept computed for this model.**/
abstract class GeneralizedLinearModel @Since("1.0.0") (@Since("1.0.0") val weights: Vector,@Since("0.8.0") val intercept: Double)extends Serializable {/*** Predict the result given a data point and the weights learned.** @param dataMatrix Row vector containing the features for this data point* @param weightMatrix Column vector containing the weights of the model* @param intercept Intercept of the model.*/protected def predictPoint(dataMatrix: Vector, weightMatrix: Vector, intercept: Double): Double/*** Predict values for the given data set using the model trained.** @param testData RDD representing data points to be predicted* @return RDD[Double] where each entry contains the corresponding prediction**/@Since("1.0.0")def predict(testData: RDD[Vector]): RDD[Double] = {// A small optimization to avoid serializing the entire model. Only the weightsMatrix// and intercept is needed.val localWeights = weightsval bcWeights = testData.context.broadcast(localWeights)val localIntercept = intercepttestData.mapPartitions { iter =>val w = bcWeights.valueiter.map(v => predictPoint(v, w, localIntercept))}}/*** Predict values for a single data point using the model trained.** @param testData array representing a single data point* @return Double prediction from the trained model**/@Since("1.0.0")def predict(testData: Vector): Double = {predictPoint(testData, weights, intercept)}/*** Print a summary of the model.*/override def toString: String = {s"${this.getClass.getName}: intercept = ${intercept}, numFeatures = ${weights.size}"}
}

GeneralizedLinearModel也是一个抽象类,相对来说比较简单一些。
要明白GeneralizedLinearModel,只需要读懂最开始的注释就OK了。

 GeneralizedLinearModel (GLM) represents a model trained usingGeneralizedLinearAlgorithm. GLMs consist of a weight vector andan intercept.

两点信息:
1.GLM是一个用GLA训练得到的模型。
2.GLM模型由一个截距的权重与特征权重组成。

5.后记

要想了解算法的来龙去脉,还是得仔细看源码。看优秀的源码,还是挺有收获的。加油!

mllib线性回归GeneralizedLinearModel GeneralizedLinearAlgorithm源码解析相关推荐

  1. Spark特征处理之RFormula源码解析

    ##RFormula简单介绍 RFormula通过R模型公式来操作列. 支持R操作中的部分操作包括'~', '.', ':', '+'以及'-'. 1. ~分隔目标和对象2. +合并对象," ...

  2. Alink漫谈(二十) :卡方检验源码解析

    Alink漫谈(二十) :卡方检验源码解析 文章目录 Alink漫谈(二十) :卡方检验源码解析 0x00 摘要 0x01 背景概念 1.1 假设检验 1.2 H0和H1是什么? 1.3 P值 (P- ...

  3. 谷歌BERT预训练源码解析(二):模型构建

    版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明. 本文链接:https://blog.csdn.net/weixin_39470744/arti ...

  4. 谷歌BERT预训练源码解析(三):训练过程

    目录 前言 源码解析 主函数 自定义模型 遮蔽词预测 下一句预测 规范化数据集 前言 本部分介绍BERT训练过程,BERT模型训练过程是在自己的TPU上进行的,这部分我没做过研究所以不做深入探讨.BE ...

  5. 谷歌BERT预训练源码解析(一):训练数据生成

    目录 预训练源码结构简介 输入输出 源码解析 参数 主函数 创建训练实例 下一句预测&实例生成 随机遮蔽 输出 结果一览 预训练源码结构简介 关于BERT,简单来说,它是一个基于Transfo ...

  6. Gin源码解析和例子——中间件(middleware)

    在<Gin源码解析和例子--路由>一文中,我们已经初识中间件.本文将继续探讨这个技术.(转载请指明出于breaksoftware的csdn博客) Gin的中间件,本质是一个匿名回调函数.这 ...

  7. Colly源码解析——结合例子分析底层实现

    通过<Colly源码解析--框架>分析,我们可以知道Colly执行的主要流程.本文将结合http://go-colly.org上的例子分析一些高级设置的底层实现.(转载请指明出于break ...

  8. libev源码解析——定时器监视器和组织形式

    我们先看下定时器监视器的数据结构.(转载请指明出于breaksoftware的csdn博客) /* invoked after a specific time, repeatable (based o ...

  9. libev源码解析——定时器原理

    本文将回答<libev源码解析--I/O模型>中抛出的两个问题.(转载请指明出于breaksoftware的csdn博客) 对于问题1:为什么backend_poll函数需要指定超时?我们 ...

  10. libev源码解析——I/O模型

    在<libev源码解析--总览>一文中,我们介绍过,libev是一个基于事件的循环库.本文将介绍其和事件及循环之间的关系.(转载请指明出于breaksoftware的csdn博客) 目前i ...

最新文章

  1. 【基本操作】主席数统计区间不同颜色个数
  2. [Hibernate]在VS2010中应用NHibernate 3.2与MySQL
  3. mysql重做日志恢复数据_MySQL中重做日志,回滚日志,以及二进制日志的简单总结...
  4. Linux 初级常用指令
  5. python交互式编程入门先学什么_为什么 Python 对于编程入门学习来说,是一门很棒的语言...
  6. Github年度人气最高的TOP10 Python项目
  7. UVA - 12083 Guardian of Decency (二分匹配)
  8. get post put delete
  9. 漫画丨那些年,我们一起被毁过的“三观”…
  10. 动态风云--互联网感言(三)
  11. Web测试实践——每日例会记录12.30(1)
  12. 快速打开 Mac OS X 隐藏的用户资源库文件夹
  13. 概率图模型(PGM)学习笔记(三)模式推断与概率图流
  14. 解决Carsim2016找不Liscens问题
  15. 豆客服务器不稳定,豆客平台登陆器
  16. matlab中princ,主成分分析matlab源程序代码(最新整理)
  17. 依托智慧警务 打造城市公共安全防控新模式
  18. 机器学习中的正则化项(L1, L2)的理解
  19. 京东云php环境配置,干货 | 京东云应用负载均衡(ALB)多功能实操
  20. 对圆柱面的曲面积分_积分曲面为圆柱面的曲面积分的计算

热门文章

  1. 23.3. DELETE
  2. 虚拟Linux系统使用Windows系统oracle数据库
  3. 学点PYTHON基础的东东--数据结构,算法,设计模式---访问者模式
  4. fg、bg、jobs、、nohup、ctrl + z命令
  5. 写项目文档比写代码难多了
  6. 模式实例之——中介者实例
  7. 编程语言的通用概念[共同特征]
  8. 地老天荒只是一个华丽的传说
  9. 034 Maven中的dependencyManagement和dependencies区别
  10. RIFF和WAVE音频文件格式