学习笔记,仅供参考,有错必纠

PS : 本BLOG采用中英混合模式


非线性回归模型

k近邻

The KNN approach simply predicts a new sample using the K -closest samples from the training set.

KNN cannot be cleanly summarized by a model.Instead, its construction is solely based on the individual samples from the training data. (KNN没有一个简单的模型表达式,相反,它的建立是基于训练集中每一个单独的样本点)

To predict a new sample for regression, KNN identifies that sample’s KNNs in the predictor space. The predicted response for the new sample is then the mean of the K neighbors’ responses. Other summary statistics, such as the median, can also be used in place of the mean to predict the new sample.

The basic KNN method as described above depends on how the user defines distance between samples(用户如何定义样本点之间的距离). Euclidean distance(欧氏距离) is the most commonly used metric and is defined as follows:
(∑j=1P(xaj−xbj)2)1/2\left(\sum_{j=1}^P(x_{aj}-x_{bj})^2 \right)^{1/2} (j=1∑P​(xaj​−xbj​)2)1/2
where xax_axa​ and xbx_bxb​ are two individual samples. Minkowski distance(闵可夫斯基距离) is a generalization of Euclidean distance(欧氏距离的推广) and is defined as:
(∑j=1P(xaj−xbj)q)1/q\left(\sum_{j=1}^P(x_{aj}-x_{bj})^q \right)^{1/q} (j=1∑P​(xaj​−xbj​)q)1/q

where q > 0 (其中q>0q>0q>0). It is easy to see that when q = 2, then Minkowski distance is the same as Euclidean distance. When q = 1, then Minkowski distance is equivalent to Manhattan distance(曼哈顿距离) , which is a common metric used for samples with binary predictors(二元预测变量).

Because the KNN method fundamentally depends on distance between samples, the scale of the predictors can have a dramatic influence on the distances among samples. (预测变量的标度会极大影响距离的取值)

Data with predictors that are on vastly different scales will generate distances that are weighted towards predictors that have the largest scales.(当数据预测变量的标度相差很大时,具有最大标度的预测变量将会在整体的距离中占据很大权重)

That is, predictors with the largest scales will contribute most to the distance between samples.To avoid this potential bias and to enable each predictor to contribute equally to the distance calculation, we recommend that all predictors be centered and scaled prior to performing KNN(所有预测变量在KNN建模之前,进行中心化和标准化).

In addition to the issue of scaling, using distances between samples can be problematic if one or more of the predictor values for a sample is missing(1个或多个预测变量存在缺失值), since it is then not possible to compute the distance between samples.

First, either the samples or the predictors can be excluded from the analysis.

If a predictor contains a sufficient amount of information across the samples(如果一个预测变量在样本中包含了足够多的信息), then an alternative approach is to impute the missing data using a naive estimator(朴素贝叶斯评估器) such as the mean of the predictor(预测变量的均值), or a nearest neighbor approach that uses only the predictors with complete information(或者利用有完整信息的预测变量计算最近邻)

Upon pre-processing the data and selecting the distance metric, the next step is to find the optimal number of neighbors(最优的近邻数). Like tuning parameters from other models, K can be determined by resampling(重抽样).

需要注意的是,较小的k会导致过拟合,较大的k则会导致拟合不足。在下图中RMSE随着K的增加先快速下降,之后平稳,最后缓慢上升,这种模式的概览图对于KNN模型而言是很典型的:

The elementary version of KNN is intuitive and straightforward and can produce decent predictions, especially when the response is dependent on the local predictor structure.

However, this version does have some notable problems(很显著的问题), of which researchers have sought solutions. Two commonly noted problems are computational time (计算时间)and the disconnect between local structure and the predictive ability of KNN(局部结构与KNN预测能力之间的联系可能失效).

对于计算时间的问题,我们可以使用k维树(或称为k-d树)来解决。

A k-d tree orthogonally partitions the predictor space(正交的划分预测空间) using a tree approach.After the tree has been grown, a new sample is placed through the structure. Distances are only computed for those training observations in the tree that are close to the new sample.(只有那些靠近新样本的训练集观测需要计算距离)

当预测变量的局部结构与响应变量不相关时,KNN可能会有很差的预测效果。不相关或者包含噪声的预测变量是一大隐患,这是因为它们会使得相近的样本点在预测变量空间中相互远离。

Hence, removing irrelevant, noise-laden predictors is a key pre-processing step for KNN.

Another approach to enhancing KNN predictivity is to weight the neighbors’ contribution to the prediction of a new sample based on their distance to the new sample.In this variation, training samples that are closer to the new sample contribute more to the predicted response, while those that are farther away contribute less to the predicted response.

非线性回归模型(part3)--K近邻相关推荐

  1. 概率检索模型+模糊k近邻+粒子群优化算法(PSO)

    1. 概率检索模型 文档属于"相关"类的概率与属于"不相关"类的概率的比值(也叫"优势比"). 显然,这个比值越大,代表该文档与查询的相关度 ...

  2. 机器学习之K近邻(KNN)模型

    机器学习之KNN 本文主要介绍K近邻(KNN)模型,KNN在机器学习中是很常见的: 1.KNN模型介绍 2.KNN数学原理 3.算法及Python实现 4.小结 1.KNN模型介绍 k近邻法(k-ne ...

  3. 【Python机器学习】多项式回归、K近邻KNN回归的讲解及实战(图文解释 附源码)

    需要源码请点赞关注收藏后评论区留言私信~~~ 多项式回归 非线性回归是用一条曲线或者曲面去逼近原始样本在空间中的分布,它"贴近"原始分布的能力一般较线性回归更强. 多项式是由称为不 ...

  4. K 近邻法(K-Nearest Neighbor, K-NN)

    文章目录 1. k近邻算法 2. k近邻模型 2.1 模型 2.2 距离度量 2.2.1 距离计算代码 Python 2.3 kkk 值的选择 2.4 分类决策规则 3. 实现方法, kd树 3.1 ...

  5. K近邻算法的kd树实现

    k近邻算法的介绍 k近邻算法是一种基本的分类和回归方法,这里只实现分类的k近邻算法. k近邻算法的输入为实例的特征向量,对应特征空间的点:输出为实例的类别,可以取多类. k近邻算法不具有显式的学习过程 ...

  6. 经典机器学习算法:k近邻法

    基于李航教授的<统计学习方法>,本博客为个人学习笔记. 只记录精华,不讲废话,让看过或没看过的你和我短时间重新领悟该方法. k近邻法(knn)与k-means的比较: 两者共同点: 1.k ...

  7. python 机器学习——K 近邻分类理论及鸢尾( Iris )数据集实例操作

    K 近邻分类理论及鸢尾( Iris )数据集实例操作 一.K 近邻分类理论 二.K 近邻分类实例操作 (1)导入数据 划分训练集测试集 (3)数据标准化 (4)用 K 近邻法建立模型 (5)性能评估 ...

  8. 机器学习第七章之K近邻算法

    K近邻算法(了解) 7.1 K近邻算法 7.1.1 K近邻算法的原理介绍 7.1.2 K近邻算法的计算步骤及代码实现 7.2 数据预处理之数据归一化 7.2.1 min-max标准化 7.2.2 Z- ...

  9. 基于K近邻法的手写数字图像识别

                           数字图像处理课程论文                          题目:数字图像识别   摘要 模式识别(PatternRecognition)是一 ...

最新文章

  1. QT中如何读写ini配置文件
  2. android 注册、登录实现程序
  3. 如何在各类控件中输入/输出数据(学习笔记)
  4. Mock数据,语法规范
  5. 变位齿轮重合度计算公式_渐开线圆柱齿轮传动的重合度计算.pdf
  6. 单点登录有关跨域的点
  7. 结巴分词python教程_python结巴教程【python3怎么使用结巴分词】
  8. java 渲染器_用Java实现一个光线追踪渲染器(下)
  9. html颜色自定义器,可自定义颜色的jQuery颜色拾取器插件
  10. 淘宝APP用户行为数据分析 by 一只废鹅
  11. 一个基于WinHttp的轻量级的分片下载库介绍
  12. python中exp函数_python的math函数 python中虚数函数exp怎么表示
  13. 计算机考研abc区划分,考研abc区有什么区别
  14. Android三方依赖冲突及Gradle的exclude使用
  15. 老男孩python全栈第9期
  16. 10款比较好用的网页设计工具
  17. 【CCF会议期刊推荐】CCF推荐国际学术期刊/会议(计算机图形学与多媒体)
  18. 【超级完整】北京理工大学计算机复试机试历年真题答案2003年-2018年
  19. 停更后的第2030天,斜杠青年的新头衔,职场人的最终归宿!
  20. 老杜(杜昶旭)GRE填空笔记部分整理-by“ 1哥”+ TTC相关资料

热门文章

  1. K-means算法在手写体数字图像数据上的使用示例-代码详解
  2. raise errorclass(errno, errval) sqlalchemy.exc.InternalError: (pymysql.err.InternalError) (1366, u
  3. pandas:apply(),applymap(),map()
  4. nginx php fpm socket,php-fpm 使用 socket 方式和 nginx 通讯,(速度优化)
  5. linux怎么用jconsole_jconsole监控上Linux上的JVM
  6. 从css样式表中抽取元素尺寸
  7. iOS中的HotFix方案总结详解
  8. 用jquery校验radio单选按钮(原创)
  9. 纯 as3 项目中引用 fl 包下的类
  10. ios FMDB数据库添删改查应用