非线性回归模型(part3)--K近邻

学习笔记，仅供参考，有错必纠

PS : 本BLOG采用中英混合模式

非线性回归模型

k近邻

The KNN approach simply predicts a new sample using the K -closest samples from the training set.

KNN cannot be cleanly summarized by a model.Instead, its construction is solely based on the individual samples from the training data. (KNN没有一个简单的模型表达式，相反，它的建立是基于训练集中每一个单独的样本点)

To predict a new sample for regression, KNN identiﬁes that sample’s KNNs in the predictor space. The predicted response for the new sample is then the mean of the K neighbors’ responses. Other summary statistics, such as the median, can also be used in place of the mean to predict the new sample.

The basic KNN method as described above depends on how the user deﬁnes distance between samples(用户如何定义样本点之间的距离). Euclidean distance(欧氏距离) is the most commonly used metric and is deﬁned as follows:
(∑j=1P(xaj−xbj)2)1/2\left(\sum_{j=1}^P(x_{aj}-x_{bj})^2 \right)^{1/2} (j=1∑P(xaj−xbj)2)1/2
where xax_axa and xbx_bxb are two individual samples. Minkowski distance(闵可夫斯基距离) is a generalization of Euclidean distance(欧氏距离的推广) and is deﬁned as:
(∑j=1P(xaj−xbj)q)1/q\left(\sum_{j=1}^P(x_{aj}-x_{bj})^q \right)^{1/q} (j=1∑P(xaj−xbj)q)1/q

where q > 0 (其中q>0q>0q>0). It is easy to see that when q = 2, then Minkowski distance is the same as Euclidean distance. When q = 1, then Minkowski distance is equivalent to Manhattan distance(曼哈顿距离) , which is a common metric used for samples with binary predictors(二元预测变量).

Because the KNN method fundamentally depends on distance between samples, the scale of the predictors can have a dramatic inﬂuence on the distances among samples. (预测变量的标度会极大影响距离的取值)

Data with predictors that are on vastly diﬀerent scales will generate distances that are weighted towards predictors that have the largest scales.(当数据预测变量的标度相差很大时，具有最大标度的预测变量将会在整体的距离中占据很大权重)

That is, predictors with the largest scales will contribute most to the distance between samples.To avoid this potential bias and to enable each predictor to contribute equally to the distance calculation, we recommend that all predictors be centered and scaled prior to performing KNN(所有预测变量在KNN建模之前，进行中心化和标准化).

In addition to the issue of scaling, using distances between samples can be problematic if one or more of the predictor values for a sample is missing(1个或多个预测变量存在缺失值), since it is then not possible to compute the distance between samples.

First, either the samples or the predictors can be excluded from the analysis.

If a predictor contains a suﬃcient amount of information across the samples(如果一个预测变量在样本中包含了足够多的信息), then an alternative approach is to impute the missing data using a naive estimator(朴素贝叶斯评估器) such as the mean of the predictor(预测变量的均值), or a nearest neighbor approach that uses only the predictors with complete information(或者利用有完整信息的预测变量计算最近邻)

Upon pre-processing the data and selecting the distance metric, the next step is to ﬁnd the optimal number of neighbors(最优的近邻数). Like tuning parameters from other models, K can be determined by resampling(重抽样).

需要注意的是，较小的k会导致过拟合，较大的k则会导致拟合不足。在下图中RMSE随着K的增加先快速下降，之后平稳，最后缓慢上升，这种模式的概览图对于KNN模型而言是很典型的：

The elementary version of KNN is intuitive and straightforward and can produce decent predictions, especially when the response is dependent on the local predictor structure.

However, this version does have some notable problems(很显著的问题), of which researchers have sought solutions. Two commonly noted problems are computational time (计算时间)and the disconnect between local structure and the predictive ability of KNN(局部结构与KNN预测能力之间的联系可能失效).

对于计算时间的问题，我们可以使用k维树(或称为k-d树)来解决。

A k-d tree orthogonally partitions the predictor space(正交的划分预测空间) using a tree approach.After the tree has been grown, a new sample is placed through the structure. Distances are only computed for those training observations in the tree that are close to the new sample.(只有那些靠近新样本的训练集观测需要计算距离)

当预测变量的局部结构与响应变量不相关时，KNN可能会有很差的预测效果。不相关或者包含噪声的预测变量是一大隐患，这是因为它们会使得相近的样本点在预测变量空间中相互远离。

Hence, removing irrelevant, noise-laden predictors is a key pre-processing step for KNN.

Another approach to enhancing KNN predictivity is to weight the neighbors’ contribution to the prediction of a new sample based on their distance to the new sample.In this variation, training samples that are closer to the new sample contribute more to the predicted response, while those that are farther away contribute less to the predicted response.

非线性回归模型(part3)--K近邻相关推荐

概率检索模型+模糊k近邻+粒子群优化算法(PSO)
1. 概率检索模型文档属于"相关"类的概率与属于"不相关"类的概率的比值(也叫"优势比"). 显然,这个比值越大,代表该文档与查询的相关度 ...
机器学习之K近邻（KNN）模型
机器学习之KNN 本文主要介绍K近邻(KNN)模型,KNN在机器学习中是很常见的: 1.KNN模型介绍 2.KNN数学原理 3.算法及Python实现 4.小结 1.KNN模型介绍 k近邻法(k-ne ...
【Python机器学习】多项式回归、K近邻KNN回归的讲解及实战（图文解释附源码）
需要源码请点赞关注收藏后评论区留言私信~~~ 多项式回归非线性回归是用一条曲线或者曲面去逼近原始样本在空间中的分布,它"贴近"原始分布的能力一般较线性回归更强. 多项式是由称为不 ...
K 近邻法（K-Nearest Neighbor, K-NN）
文章目录 1. k近邻算法 2. k近邻模型 2.1 模型 2.2 距离度量 2.2.1 距离计算代码 Python 2.3 kkk 值的选择 2.4 分类决策规则 3. 实现方法, kd树 3.1 ...
K近邻算法的kd树实现
k近邻算法的介绍 k近邻算法是一种基本的分类和回归方法,这里只实现分类的k近邻算法. k近邻算法的输入为实例的特征向量,对应特征空间的点:输出为实例的类别,可以取多类. k近邻算法不具有显式的学习过程 ...
经典机器学习算法：k近邻法
基于李航教授的<统计学习方法>,本博客为个人学习笔记. 只记录精华,不讲废话,让看过或没看过的你和我短时间重新领悟该方法. k近邻法(knn)与k-means的比较: 两者共同点: 1.k ...
python 机器学习——K 近邻分类理论及鸢尾（ Iris ）数据集实例操作
K 近邻分类理论及鸢尾( Iris )数据集实例操作一.K 近邻分类理论二.K 近邻分类实例操作 (1)导入数据划分训练集测试集 (3)数据标准化 (4)用 K 近邻法建立模型 (5)性能评估 ...
机器学习第七章之K近邻算法
K近邻算法(了解) 7.1 K近邻算法 7.1.1 K近邻算法的原理介绍 7.1.2 K近邻算法的计算步骤及代码实现 7.2 数据预处理之数据归一化 7.2.1 min-max标准化 7.2.2 Z- ...
基于K近邻法的手写数字图像识别
数字图像处理课程论文题目:数字图像识别摘要模式识别(PatternRecognition)是一 ...

非线性回归模型(part3)--K近邻

非线性回归模型

k近邻

非线性回归模型(part3)--K近邻相关推荐

最新文章

热门文章