k 最近邻

机器学习模型和维数的诅咒 (Machine Learning models and the curse of dimensionality)

There is always a trade off between things in life. If you take up a certain path then there is always a possibility that you might have to compromise with some other parameter. Machine Learning models are no different, considering the case of k-Nearest Neighbor there has always been a problem which has a huge impact over classifiers that rely on pairwise distance and that problem is nothing but the “Curse of Dimensionality”. By the end of this article you will be able to create your own k-Nearest Neighbor Model and observe the impact of increasing the dimension to fit a data set. Let’s dig in!

生活中的事物之间总会有一个权衡。如果您采用某条路径，那么总是有可能不得不折衷一些其他参数。机器学习模型也没有什么不同，考虑到k最近邻的情况，一直存在着一个问题，该问题对依赖成对距离的分类器产生了巨大影响，而这个问题不过是“维数诅咒”而已。到本文结束时，您将能够创建自己的k最近邻居模型，并观察增加维度以适合数据集的影响。让我们开始吧！

Creating a k-Nearest Neighbor model:

创建k最近邻居模型：

Right before we get our hands dirty with the technical part, we need to lay the buttress for our analysis, which is nothing but the libraries.

就在我们开始接触技术部分之前，我们需要为我们的分析奠定基础，这不过是库。

Thanks to inbuilt machine learning packages which makes our job quite easy.

借助内置的机器学习包，这使我们的工作变得非常轻松。

最近邻居分类器： (Nearest neighbors classifier:)

Let’s begin with a simple nearest neighbor classifier in which we have been posed with a binary classification task: we have a set of labeled inputs, where the labels are all either 0 or 1. Our goal is to train a classifier to predict a 0 or 1 label for new, unseen test data. One conceptually simple approach is to simply find the sample in the training data that is “most similar” to our test sample (a “neighbor” in the feature space), and then give the test sample the same label as the “most similar” training sample. This is the nearest neighbors classifier.

让我们从一个简单的最近邻分类器开始，在该分类器中，我们已经执行了一个二进制分类任务：我们有一组带标签的输入，其中标签全为0或1。我们的目标是训练一个分类器来预测0或1。 1个标签，用于显示看不见的新测试数据。从概念上讲，一种简单的方法是简单地在训练数据中找到与我们的测试样本“最相似”(特征空间中的“邻居”)的样本，然后为测试样本赋予与“最相似”的相同标签训练样本。这是最近的邻居分类器。

After running few lines of code we can visualize our data set, with training data shown in blue (negative class) and red (positive class). A test sample is shown in green.For keeping things simple I have used a simple linear boundary for classification.

运行几行代码后，我们可以可视化我们的数据集，其中训练数据以蓝色(负类)和红色(正类)显示。测试样本以绿色显示。为了使事情简单，我使用了简单的线性边界进行分类。

To find the nearest neighbor, we need a distance metric. For our case, I chose to use the L2 norm. There certainly are few perks of using the L2 norm as a distance metric, considering that we don’t have any outliers the L2 norm minimizes the mean cost and treats every feature equally.

为了找到最近的邻居，我们需要一个距离度量 。对于我们的情况，我选择使用L2范数。考虑到我们没有任何异常值，使用L2范数作为距离度量当然很少有好处，因为L2范数可以最大程度地降低平均成本并平等地对待每个特征。

The nearest neighbor to the test sample is circled, and its label is applied as the prediction for the test sample:

圈出最接近测试样本的邻居，并使用其标签作为测试样本的预测：

Using nearest neighbor we successfully classified our test value as label “0”, but again we made an assumption of no outliers and we also moderated the noise.

使用最近的邻居，我们成功地将测试值分类为标签“ 0”，但是我们再次假设没有离群值，并且也降低了噪声。

The nearest neighbor classifier works by “memorizing” the training data. One interesting consequence of this is that it will have zero prediction error (or equivalently, 100% accuracy) on the training data, since each training sample’s nearest neighbor is itself:

最近的邻居分类器通过“存储”训练数据来工作。一个有趣的结果是，由于每个训练样本的最近邻居本身都是零，因此在训练数据上它将具有零预测误差(或等效地，为100％的准确性)：

Now we look to overcome the shortcomings of the nearest neighbor model and the answer lies in the model named as the k-Nearest Neighbor classifier.

现在，我们着眼于克服最邻近模型的缺点，答案就在于名为k-最邻近分类器的模型。

K个最近邻居分类器： (K nearest neighbors classifier:)

To make this approach less sensitive to noise, we might choose to look for multiple similar training samples to each new test sample, and classify the new test sample using the mode of the labels of the similar training samples. This is k nearest neighbors, where k is the number of “neighbors” that we search for.

为了使这种方法对噪声的敏感性降低，我们可以选择为每个新的测试样本寻找多个相似的训练样本，并使用相似的训练样本的标签模式对新的测试样本进行分类。这是k个最近的邻居，其中k是我们搜索的“邻居”数。

In the following plot, we show the same data as in the previous example. Now, however, the 3 closest neighbors to the test sample are circled, and the mode of their labels is used as the prediction for the new test sample. Feel free to play with the parameter k and observe the changes.

在下图中，我们显示了与上一个示例相同的数据。但是，现在，将最接近测试样本的3个邻居圈起来，并将其标签的模式用作新测试样本的预测。随意使用参数k并观察其变化。

The following image shows a set of test points plotted on top of the training data. The size of each test points indicate the confidence in the label, which we approximate by the proportion of k neighbors sharing that label.

下图显示了在训练数据上方绘制的一组测试点。每个测试点的大小表示对标签的置信度 ，我们可以通过共享该标签的k个邻居的比例来近似。

The bigger the dots are means that the confidence score is higher for those points.

点越大表示这些点的置信度得分越高。

Also note that the training error for k nearest neighbors is not necessarily zero (though it can be!), since a training sample may have a different label than its k closest neighbors.

还应注意，k个最邻近邻居的训练误差不一定为零(尽管可能是！)，因为训练样本可能具有与其k个最邻近邻居不同的标签。

功能缩放： (Feature scaling:)

One important limitation of k nearest neighbors is that it does not “learn” anything about which features are most important for determining y. Every feature is weighted equally in finding the nearest neighbor.

k个最近邻居的一个重要限制是它不“学习”关于哪些特征对于确定y最重要。在寻找最接近的邻居时，每个要素的权重均相等。

The first implication of this is:

这的第一个含义是：

If all features are equally important, but they are not all on the same scale, they must be normalized — re scaled onto the interval [0,1]. Otherwise, the features with the largest magnitudes will dominate the total distance.如果所有功能都同等重要，但是它们的缩放比例不同，则必须将它们归一化-重新缩放为间隔[0,1]。否则，幅度最大的要素将主导总距离。

The second implication is:

第二个含义是：

Even if some features are more important than others, they will all be considered equally important in the distance calculation. If uninformative features are included, they may dominate the distance calculation.即使某些功能比其他功能更重要，它们在距离计算中也将被视为同等重要。如果包括非信息性特征，则它们可能会主导距离计算。

Contrast this with our logistic regression classifier. In the logistic regression, the training process involves learning coefficients. The coefficients weight each feature’s effect on the overall output.

将此与我们的逻辑回归分类器进行对比。在逻辑回归中，训练过程涉及学习系数。系数加权每个功能对整体输出的影响。

Let’s see how our model performs for an image classification problem. Consider the following images from CIFAR10, a dataset of low-resolution images in ten classes:

让我们看看我们的模型如何处理图像分类问题。考虑以下来自CIFAR10的图像，它是十类低分辨率图像的数据集：

The images above show a test sample and two training samples with their distances to the test sample.

上图显示了一个测试样本和两个训练样本以及它们与测试样本的距离。

The background pixels in the test sample “count” just as much as the foreground pixels, so that the image of the deer is considered a very close neighbor, while the image of the car is not. As stated before we used L2 norm and our model considers every pixel to be equal so it makes it difficult for nearest neighbor to classify real time images.

测试样本中的背景像素“计数”与前景像素一样多，因此，鹿的图像被认为是非常近的邻居，而汽车的图像则不是。如前所述，我们使用L2范数，并且我们的模型认为每个像素都相等，因此最近邻很难对实时图像进行分类。

We also see here that Euclidean distance is not a good metric of visual similarity — the frog on the right is almost as similar to the car as the deer in the middle!

我们在这里还看到，欧几里得距离不是视觉相似度的良好度量标准-右侧的青蛙与汽车之间的距离几乎与中间的鹿一样！

K最近邻居回归： (K nearest neighbors regression:)

K nearest neighbors can also be used for regression, with just a small change: instead of using the mode of the nearest neighbors to predict the label of a new sample, we use the mean. Consider the following training data:

K个最接近的邻居也可以用于回归，只做很小的改变：我们使用均值，而不是使用最接近的邻居的模式来预测新样本的标签。考虑以下训练数据：

We can add a test sample, then use k nearest neighbors to predict its value:

我们可以添加一个测试样本，然后使用k个最近的邻居来预测其值：

“维数的诅咒”： (The “curse of dimensionality”:)

Classifiers that rely on pairwise distance between points, like the k neighbors methods, are heavily impacted by a problem known as the “curse of dimensionality”. In this section, I will illustrate the problem. We will look at a problem with data uniformly distributed in each dimension of the feature space, and two classes separated by a linear boundary.

像k邻居方法一样，依赖点之间成对距离的分类器受到称为“维数诅咒”的问题的严重影响。在本节中，我将说明问题。我们将研究一个数据均匀分布在特征空间各个维度上的问题，并且两个类之间由线性边界分隔。

We will generate a test point, and show the k nearest neighbors to the test point. We will also show the length (or area, or volume) that we had to search to find those k test points. We will observe the radius required to find the nearest neighbor for increasing dimension space.

我们将生成一个测试点，并显示距该测试点最近的k个邻居。我们还将显示为找到这k个测试点而必须搜索的长度(或面积或体积)。我们将观察为增加尺寸空间而寻找最接近的邻居所需的半径。

Pay special attention to how that length (or area, or volume) changes as we increase the dimensionality of the feature space.

当我们增加特征空间的维数时，请特别注意长度(或面积或体积)如何变化。

First, let's observe the 1D problem:

首先，让我们观察一维问题：

Now, the 2D equivalent:

现在，等效于2D：

Finally, the 3D equivalent:

最后，等效于3D：

We can see that as the dimensionality of the problem grows, the higher-dimensional space is less densely occupied by the training data, and we need to search a large volume of space to find neighbors of the test point. The pair-wise distance between points grows as we add additional dimensions.

我们可以看到，随着问题维数的增长，高维空间被训练数据所占据的密度降低，并且我们需要搜索大量空间以找到测试点的邻居。 点之间的成对距离随着我们添加其他尺寸而增大。

And in that case, the neighbors may be so far away that they don’t actually have much in common with the test point.

在这种情况下，邻居可能相距太远，以至于他们实际上与测试点没有太多共同之处。

In general, the length of the smallest hyper-cube that contains all k-nearest neighbors of a test point is:

通常，包含测试点的所有k个最近邻的最小超立方体的长度为：

(k/N)¹/d

(k / N)¹/ d

for N samples with dimensionality d.

对于N个维数为d的样本。

From the expression above, we can see that as the number of dimensions increases linearly, the number of training samples must increase exponentially to counter the “curse”.

从上面的表达式中，我们可以看到，随着维数线性增加，训练样本的数量必须成倍增加以抵消“诅咒”。

Alternatively, we can reduce d — either by feature selection or by transforming the data into a lower-dimensional space.

或者，我们可以通过特征选择或将数据转换为低维空间来减小d。

翻译自: https://towardsdatascience.com/k-nearest-neighbors-and-the-curse-of-dimensionality-7d64634015d9

k 最近邻

查看全文

http://www.taodudu.cc/news/show-863642.html

使用Pytorch进行密集视频字幕
5g与edge ai_使用OpenVINO部署AI Edge应用
法庭上认可零和博弈的理论吗_从零开始的本征理论
极限学习机和支持向量机_极限学习机I
如何在不亏本的情况下构建道德数据科学系统？
ann人工神经网络_深度学习-人工神经网络（ANN）
唐宇迪机器学习课程数据集_最受欢迎的数据科学和机器学习课程-2020年8月
r中如何求变量的对数转换_对数转换以求阳性。
美团脱颖而出的经验_使数据科学项目脱颖而出的6种方法
aws rds同步_将数据从Python同步到AWS RDS
扫描二维码读取文档_使用深度学习读取和分类扫描的文档
电路分析导论_生存分析导论
强化学习-第3部分
范数在机器学习中的作用_设计在机器学习中的作用
贝叶斯深度神经网络_深度学习为何胜过贝叶斯神经网络
模型监控psi_PSI和CSI：前2个模型监控指标
flask渲染图像_用于图像推荐的Flask应用
pytorch贝叶斯网络_贝叶斯神经网络：2个在TensorFlow和Pytorch中完全连接
稀疏组套索_Python中的稀疏组套索
deepin中zz_如何解决R中的FizzBuzz问题
图像生成对抗生成网络gan_GAN生成汽车图像
生成模型和判别模型_生成模型和判别模型简介
机器学习算法拟合曲线_制定学习曲线以检测机器学习算法中的错误
重拾强化学习的核心概念_强化学习的核心概念
gpt 语言模型_您可以使用语言模型构建的事物的列表-不仅仅是GPT-3
廉价raid_如何查找80行代码中的廉价航班
深度学习数据集制作工作_创建我的第一个深度学习+数据科学工作站
pytorch线性回归_PyTorch中的线性回归
spotify音乐下载_使用Python和R对音乐进行聚类以在Spotify上创建播放列表。
强化学习之基础入门_强化学习基础

k 最近邻_k最近邻与维数的诅咒相关推荐

Python实现相空间重构求关联维数——GP算法、自相关法求时间延迟tau、最近邻算法求嵌入维数m
Python实现相空间重构求关联维数--GP算法.自相关法求时间延迟tau.最近邻算法求嵌入维数m GP算法: 若有一维时间序列为{x1,x2,-,xn},对其进行相空间重构得到高维相空间的一系列向量 ...
维度灾难维数灾难暂记
距离度量问题对于基于距离的模型KNN,K-means来说.需要有效的降维,或者大量数据的训练,发现数据的低维流形空间. Theorem[Beyer et al.99]:Fix ϵ\epsilonϵ ...
维数灾难(from wiji)
维数灾难(英语:curse of dimensionality,又名维度的詛咒)是一个最早由理查德·贝尔曼(Richard E. Bellman)在考虑优化问题时首次提出来的术语[1][2],用来描述 ...
OpenCV图像处理（3）——盒维数计算
计算分形盒子维 //************************// //计算分形盒子维 //*** yangxin_szu 2013_03_28 ***// //valarray与 MFC 有一 ...
（邱维声）高等代数课程笔记：基，维数与坐标
3.5 基,维数与坐标 \quad 本节,继续研究线性空间的结构.一般地,设 V V V 是数域 K K K 上的一个线性空间. \quad 首先,我们先将"线性相关"与" ...
特征选择（一）-维数问题与类内距离
什么是特征选择? 简单说,特征选择就是降维. 特征选择的任务就是要从n维向量中选取m个特征,把原向量降维成为一个m维向量.但是降维必须保证类别的可分离性或者说分类器的性能下降不多. 注意降维具有片面 ...
机器学习笔记之降维(一)维数灾难
机器学习笔记之降维--维数灾难引言回顾:过拟合维度灾难从数值角度观察维数灾难从几何角度观察维度灾难示例1 示例2 引言本节将介绍降维算法,并介绍降维算法的相关背景. 回顾:过拟合我们在 ...
线性空间，线性子空间，基与维数
索引集合: V V V 线性空间 ( V , ⊕ , ⊗ ) \left( V,\oplus ,\otimes \right) (V,⊕,⊗) 线性子空间 ( W , ⊕ , ⊗ ) ( W ⊆ V ...
矩阵的迹\矩阵的秩\伴随矩阵\共轭矩阵，基底、维数与秩，相对某个基底的坐标计算方法
矩阵的迹(Trace) n × n n\times n n×n的方阵A的n个对角线元素的和称为方阵A的迹,记作tr(A). A = ( a 11 ⋯ a 1 n ⋮ ⋮ a n 1 ⋯ a n ...

k 最近邻_k最近邻与维数的诅咒