参考文献

算法介绍 (Algorithm introduction)

kNN (k nearest neighbors) is one of the simplest ML algorithms, often taught as one of the first algorithms during introductory courses. It’s relatively simple but quite powerful, although rarely time is spent on understanding its computational complexity and practical issues. It can be used both for classification and regression with the same complexity, so for simplicity, we’ll consider the kNN classifier.

kNN(k个最邻近的邻居)是最简单的ML算法之一,通常在入门课程中被教为最早的算法之一。 它相对简单但功能强大,尽管很少花费时间来了解其计算复杂性和实际问题。 它可以用于具有相同复杂度的分类和回归,因此为简单起见,我们将考虑kNN分类器。

kNN is an associative algorithm — during prediction it searches for the nearest neighbors and takes their majority vote as the class predicted for the sample. Training phase may or may not exist at all, as in general, we have 2 possibilities:

kNN是一种关联算法-在预测过程中会搜索最近的邻居,并将其多数票作为为样本预测的类别。 培训阶段可能不存在,也可能根本不存在,因为一般来说,我们有两种可能性:

  1. Brute force method — calculate distance from new point to every point in training data matrix X, sort distances and take k nearest, then do a majority vote. There is no need for separate training, so we only consider prediction complexity.
    蛮力法—计算训练数据矩阵X中从新点到每个点的距离,对距离进行排序并取k最接近,然后进行多数表决。 不需要单独的培训,因此我们仅考虑预测的复杂性。
  2. Using data structure — organize the training points from X into the auxiliary data structure for faster nearest neighbors lookup. This approach uses additional space and time (for creating data structure during training phase) for faster predictions.
    使用数据结构-将X的训练点组织到辅助数据结构中,以更快地进行最近邻居查找。 这种方法使用了额外的空间和时间(用于在训练阶段创建数据结构),以加快预测速度。

We focus on the methods implemented in Scikit-learn, the most popular ML library for Python. It supports brute force, k-d tree and ball tree data structures. These are relatively simple, efficient and perfectly suited for the kNN algorithm. Construction of these trees stems from computational geometry, not from machine learning, and does not concern us that much, so I’ll cover it in less detail, more on the conceptual level. For more details on that, see links at the end of the article.

我们专注于Scikit-learn中实现的方法,Scikit-learn是最流行的Python ML库。 它支持暴力,kd树和球树数据结构。 这些相对简单,有效并且非常适合kNN算法。 这些树的构建源于计算几何,而不是来自机器学习,并且与我们的关系不大,因此我将在概念层面上更详细地介绍它。 有关这方面的更多详细信息,请参见本文结尾处的链接。

In all complexities below times of calculating the distance were omitted since they are in most cases negligible compared to the rest of the algorithm. Additionally, we mark:

在所有复杂情况下,省略了计算距离的时间,因为与算法的其余部分相比,它们在大多数情况下可以忽略不计。 此外,我们标记:

  • n: number of points in the training dataset

    n :训练数据集中的点数

  • d: data dimensionality

    d :数据维数

  • k: number of neighbors that we consider for voting

    k :我们考虑投票的邻居数

蛮力法 (Brute force method)

Training time complexity: O(1)

训练时间复杂度: O(1)

Training space complexity: O(1)

训练空间复杂度: O(1)

Prediction time complexity: O(k * n)

预测时间复杂度: O(k * n)

Prediction space complexity: O(1)

预测空间复杂度: O(1)

Training phase technically does not exist, since all computation is done during prediction, so we have O(1) for both time and space.

从技术上讲,训练阶段是不存在的,因为所有计算都是在预测过程中完成的,所以我们在时间和空间上都具有O(1)

Prediction phase is, as method name suggest, a simple exhaustive search, which in pseudocode is:

正如方法名称所暗示的那样,预测阶段是一个简单的详尽搜索,在伪代码中为:

Loop through all points k times:    1. Compute the distance between currently classifier sample and        training points, remember the index of the element with the        smallest distance (ignore previously selected points)    2. Add the class at found index to the counterReturn the class with the most votes as a prediction

This is a nested loop structure, where the outer loop takes k steps and the inner loop takes n steps. 3rd point is O(1) and 4th is O(# of classes), so they are smaller. Therefore, we have O(n * k) time complexity.

这是一个嵌套循环结构,其中外循环采用k步,内循环采用n步。 第3个点是O(1) ,第4个点是O(1) O(# of classes) ,所以它们更小。 因此,我们有O(n * k)时间复杂度。

As for space complexity, we need a small vector to count the votes for each class. It’s almost always very small and is fixed, so we can treat is as a O(1) space complexity.

至于空间的复杂性,我们需要一个小的向量来计算每个类别的票数。 它几乎总是很小并且是固定的,因此我们可以将其视为O(1)空间复杂度。

kd树法 (k-d tree method)

Training time complexity: O(d * n * log(n))

训练时间复杂度: O(d * n * log(n))

Training space complexity: O(d * n)

训练空间复杂度: O(d * n)

Prediction time complexity: O(k * log(n))

预测时间复杂度: O(k * log(n))

Prediction space complexity: O(1)

预测空间复杂度: O(1)

During the training phase, we have to construct the k-d tree. This data structure splits the k-dimensional space (here k means k dimensions of space, don’t confuse this with k as a number of nearest neighbors!) and allows faster search for nearest points, since we “know where to look” in that space. You may think of it like a generalization of BST for many dimensions. It “cuts” space with axis-aligned cuts, diving points into groups in children nodes.

在训练阶段,我们必须构建kd树。 此数据结构可分割k维空间(此处k表示k维空间,不要将k与最近邻的数量混淆!)并允许更快地搜索最近的点,因为我们“知道在哪里看”那个空间。 您可能会认为它像BST在许多方面的概括。 它通过与轴对齐的切割“切割”空间,将潜水点分成子节点组。

Constructing the k-d tree is not a machine learning task itself, since it stems from computational geometry domain, so we won’t cover this in detail, only on conceptual level. The time complexity is usually O(d * n * log(n)), because insertion is O(log(n)) (similar to regular BST) and we have n points from the training dataset, each with d dimensions. I assume the efficient implementation of the data structure, i. e. it finds the optimal split point (median in the dimension) in O(n), which is possible with the median of medians algorithm. Space complexity is O(d * n) — note that it depends on dimensionality d, which makes sense, since it more dimensions correspond to more space divisions and larger tree (in addition to larger time complexity for the same reason).

构造kd树本身并不是机器学习任务,因为它起源于计算几何学领域,因此我们仅在概念级别上不对此进行详细介绍。 时间复杂度通常为O(d * n * log(n)) ,因为插入为O(log(n)) (类似于常规BST),并且训练数据集中有n个点,每个点的维数为d 。 我假设数据结构的有效实现,即在O(n)中找到最佳分割点(维数的中位数O(n) ,这可以通过中位数算法的中位数来实现。 空间复杂度为O(d * n) -请注意,它取决于维数d ,这是有道理的,因为它的维数更多,对应于更多的空间划分和更大的树(除了出于相同原因的更大的时间复杂度)。

As for the prediction phase, the k-d tree structure naturally supports “k nearest point neighbors query” operation, which is exactly what we need for kNN. The simple approach is to just query k times, removing the point found each time — since query takes O(log(n)), it is O(k * log(n)) in total. But since k-d tree already cut space during construction, after a single query we approximately know where to look — we can just search the “surroundings” around that point. Therefore, practical implementations of k-d tree support querying for whole k neighbors at one time and with complexity O(sqrt(n) + k), which is much better for larger dimensionalities, which are very common in machine learning.

至于预测阶段,kd树结构自然支持“ k最近点邻居查询”操作,这正是我们对kNN所需要的。 简单的方法是只查询k倍,去掉点每次发现-因为查询时间O(log(n)) ,它是O(k * log(n))的总额。 但是,由于kd树已经在构造过程中削减了空间,因此在一次查询后,我们大约知道了要看的地方–我们只需搜索该点周围的“周围环境”即可。 因此,kd树的实际实现支持一次查询整个k邻居,并且查询复杂度为O(sqrt(n) + k) ,这对于机器学习中非常常见的较大维度而言要好得多。

The above complexities are the average ones, assuming the balanced k-d tree. The O(log(n)) times assumed above may degrade up to O(n) for unbalanced trees, but if the median is used during the tree construction, we should always get a tree with approximately O(log(n)) insertion/deletion/search complexity.

假设平衡kd树,上述复杂度是平均水平 。 上面假设的O(log(n))时间对于不平衡的树可能会降级为O(n) ,但是如果在树的构造过程中使用中位数,则我们应该始终获得插入了O(log(n))左右的树/删除/搜索复杂度。

球树法 (ball tree method)

Training time complexity: O(d * n * log(n))

训练时间复杂度: O(d * n * log(n))

Training space complexity: O(d * n)

训练空间复杂度: O(d * n)

Prediction time complexity: O(k * log(n))

预测时间复杂度: O(k * log(n))

Prediction space complexity: O(1)

预测空间复杂度: O(1)

Ball tree algorithm takes another approach to dividing space where training points lie. In contrast to k-d trees, which divides space with median value “cuts”, ball tree groups points into “balls” organized into a tree structure. They go from the largest (root, with all points) to the smallest (leaves, with only a few or even 1 point). It allows fast nearest neighbor lookup because near neighbors are in the same or at least close “balls”.

球树算法采用另一种方法来划分训练点所在的空间。 与kd树相比,kd树用中位数“切割”划分空间,而球树组则指向组织成树结构的“球”。 它们从最大的(根,所有点)到最小的(叶,只有几个或什至1个点)。 它允许快速的最近邻居查找,因为附近邻居处于相同或至少接近的“球”中。

During the training phase, we only need to construct the ball tree. There are a few algorithms for constructing the ball tree, but the one most similar to k-d tree (called “k-d construction algorithm” for that reason) is O(d * n * log(n)), the same as k-d tree.

在训练阶段,我们只需要构建球树。 有几种构造球树的算法,但是与kd树最相似的一种算法(因此被称为“ kd构造算法”)是O(d * n * log(n)) ,与kd树相同。

Because of the tree building similarity, the complexities of the prediction phase are also the same as for k-d tree.

由于树的构建相似性,预测阶段的复杂性也与kd树相同。

在实践中选择方法 (Choosing the method in practice)

To summarize the complexities: brute force is the slowest in the big O notation, while both k-d tree and ball tree have the same lower complexity. How do we know which one to use then?

概括一下复杂性:在大O表示法中,蛮力是最慢的,而kd树和球树的复杂性都较低。 我们怎么知道该使用哪个呢?

To get the answer, we have to look at both training and prediction times, that’s why I have provided both. The brute force algorithm has only one complexity, for prediction, O(k * n). Other algorithms need to create the data structure first, so for training and prediction they get O(d * n * log(n) + k * log(n)), not taking into account the space complexity, which may also be important. Therefore, where the construction of the trees is frequent, the training phase may outweigh their advantage of faster nearest neighbor lookup.

要获得答案,我们必须同时考虑训练和预测时间,这就是我同时提供两者的原因。 蛮力算法只有一种复杂性,即预测O(k * n) 。 其他算法需要首先创建数据结构,因此对于训练和预测,它们得到O(d * n * log(n) + k * log(n)) ,而不考虑空间复杂度,这也可能很重要。 因此,在树木建造频繁的地方,训练阶段可能会超过其更快的最近邻居查找的优势。

Should we use k-d tree or ball tree? It depends on the data structure — relatively uniform or “well behaved” data will make better use of k-d tree, since the cuts of space will work well (near points will be close in the leaves after all cuts). For more clustered data the “balls” from the ball tree will reflect the structure better and therefore allow for faster nearest neighbor search. Fortunately, Scikit-learn supports “auto” option, which will automatically infer the best data structure from the data.

我们应该使用kd树还是球树? 这取决于数据结构-相对统一或“表现良好”的数据将更好地利用kd树,因为空间切割会很好地工作(所有切割之后,叶子附近的点将很靠近)。 对于更多的群集数据,来自球树的“球”将更好地反映结构,因此可以更快地进行最近邻居搜索。 幸运的是,Scikit-learn支持“自动”选项,它将自动从数据中推断出最佳的数据结构。

Let’s see this in practice on two case studies, which I’ve encountered in practice during my studies and job.

让我们在两个案例研究中在实践中看到这一点,这是我在学习和工作期间在实践中遇到的。

案例研究1:分类 (Case study 1: classification)

The more “traditional” application of the kNN is the classification of data. It often has quite a lot of points, e. g. MNIST has 60k training images and 10k test images. Classification is done offline, which means we first do the training phase, then just use the results during prediction. Therefore, if we want to construct the data structure, we only need to do so once. For 10k test images, let’s compare the brute force (which calculates all distances every time) and k-d tree for 3 neighbors:

kNN的更“传统”应用是数据分类。 它通常有很多要点,例如MNIST具有60k训练图像和10k测试图像。 分类是在离线状态下完成的,这意味着我们首先要进行训练阶段,然后仅在预测过程中使用结果。 因此,如果我们要构造数据结构,则只需要做一次。 对于10k的测试图像,让我们比较3个邻居的蛮力(每次都会计算所有距离)和kd树:

Brute force (O(k * n)): 3 * 10,000 = 30,000

蛮力( O(k * n) ): 3 * 10,000 = 30,000

k-d tree (O(k * log(n))): 3 * log(10,000) ~ 3 * 13 = 39

kd树( O(k * log(n)) ): 3 * log(10,000) ~ 3 * 13 = 39

Comparison: 39 / 30,000 = 0.0013

比较: 39 / 30,000 = 0.0013

As you can see, the performance gain is huge! The data structure method uses only a tiny fraction of the brute force time. For most datasets this method is a clear winner.

如您所见,性能提升巨大! 数据结构方法仅使用蛮力时间的一小部分。 对于大多数数据集,此方法无疑是赢家。

案例研究2:实时智能监控 (Case study 2: real-time smart monitoring)

Machine Learning is commonly used for image recognition, often using neural networks. It’s very useful for real-time applications, where it’s often integrated with cameras, alarms etc. The problem with neural networks is that they often detect the same object 2 or more times — even the best architectures like YOLO have this problem. We can actually solve it with nearest neighbor search with a simple approach:

机器学习通常用于神经网络的图像识别。 它对于经常与摄像头,警报器等集成在一起的实时应用程序非常有用。神经网络的问题在于它们经常检测同一对象两次或两次以上,即使像YOLO这样的最佳架构也存在此问题。 实际上,我们可以使用一种简单的方法通过最近邻搜索来解决它:

  1. Calculate the center of each bounding box (rectangle)
    计算每个边界框的中心(矩形)
  2. For each rectangle, search for its nearest neighbor (1NN)
    对于每个矩形,搜索其最近的邻居(1NN)
  3. If points are closer than the selected threshold, merge them (they detect the same object)
    如果点比所选阈值近,请合并它们(它们检测到相同的对象)

The crucial part is searching for the closest center of another bounding box (point 2). Which algorithm should be used here? Typically we have only a few moving objects on camera, maybe up to 30–40. For such a small number, speedup from using data structures for faster lookup is negligible. Each frame is a separate image, so if we wanted to construct a k-d tree for example, we would have to do so for every frame, which may mean 30 times per second — a huge cost overall. Therefore, for such situation a simple brute force method works fastest and also has the smallest space requirement (which, with heavy neural networks or for embedded CPUs in cameras, may be important).

关键部分是搜索另一个边界框(点2)的最近中心。 在此应使用哪种算法? 通常,我们在相机上只有几个移动的物体,也许最多30–40。 对于这么小的数字,使用数据结构进行更快的查找所产生的加速作用可以忽略不计。 每个帧都是一幅单独的图像,因此,例如,如果我们要构建kd树,则必须为每个帧都这样做,这可能意味着每秒30次,这总成本很高。 因此,对于这种情况,简单的蛮力方法工作最快,并且空间需求最小(对于大型神经网络或相机中的嵌入式CPU,这可能很重要)。

摘要 (Summary)

kNN algorithm is a popular, easy and useful technique in Machine Learning, and I hope after reading this article you understand it’s complexities and real world scenarios where and how can you use this method.

kNN算法是机器学习中一种流行,简单且有用的技术,我希望阅读本文后,您可以了解它的复杂性和现实场景,以及在何处以及如何使用此方法。

翻译自: https://towardsdatascience.com/k-nearest-neighbors-computational-complexity-502d2c440d5

参考文献

http://www.taodudu.cc/news/show-863433.html

相关文章:

  • 深度学习用于视频检测_视频如何用于检测您的个性?
  • 角距离恒星_恒星问卷调查的10倍机器学习生产率
  • apache beam_Apache Beam ML模型部署
  • 转正老板让你谈谈你的看法_让我们谈谈逻辑回归
  • openai-gpt_GPT-3报告存在的问题
  • 机器学习 凝聚态物理_机器学习遇到了凝聚的问题
  • 量子计算 qiskit_将Tensorflow和Qiskit集成到量子机器学习中
  • throw 烦人_烦人的简单句子聚类
  • 使用NumPy优于Python列表的优势
  • 迷你5和迷你4区别_可变大小的视频迷你批处理
  • power bi可视化表_如何使用Power BI可视化数据?
  • 变形金刚2_变形金刚(
  • 机器学习 测试_测试优先机器学习
  • azure机器学习_Microsoft Azure机器学习x Udacity —第4课笔记
  • 机器学习嵌入式实现_机器学习中的嵌入
  • 无监督学习 k-means_无监督学习-第3部分
  • linkedin爬虫_机器学习的学生和从业者的常见问题在LinkedIn上提问
  • lime 深度学习_用LIME解释机器学习预测并建立信任
  • 神经网络 梯度下降_梯度下降优化器对神经网络训练的影响
  • 深度学习实践:计算机视觉_深度学习与传统计算机视觉技术:您应该选择哪个?
  • 卷积神经网络如何解释和预测图像
  • 深度学习 正则化 正则化率_何时以及如何在深度学习中使用正则化
  • 杨超越微数据_资料来源同意:数据科学技能超越数据
  • 统计概率分布_概率统计中的重要分布
  • 人口预测和阻尼-增长模型_使用分类模型预测利率-第1部分
  • 基于kb的问答系统_1KB以下基于表的Q学习
  • 图论为什么这么难_图论是什么,为什么要关心?
  • 使用RNN和TensorFlow创建自己的Harry Potter短故事
  • bitnami如何使用_使用Bitnami获取完全配置的Apache Airflow Docker开发堆栈
  • cox风险回归模型参数估计_信用风险管理:分类模型和超参数调整

参考文献_参考文献:相关推荐

  1. etal斜体吗 参考文献_参考文献类有关论文范例,与日语文提纲2016年,日语文提纲科目相关本科毕业论文范文...

    参考文献类有关论文范例,与日语文提纲2016年,日语文提纲科目相关本科毕业论文范文 本论文是一篇参考文献类有关本科毕业论文范文,关于日语文提纲2016年,日语文提纲科目相关毕业论文开题报告范文.免费优 ...

  2. etal斜体吗 参考文献_参考文献方面毕业论文格式模板,与文下载****北京化工大学学报相关论文下载...

    参考文献方面毕业论文格式模板,与文下载****北京化工大学学报相关论文下载 这是一篇参考文献方面毕业论文提纲范文,与文下载****北京化工大学学报相关毕业论文格式模板.是写论文专业与参考文献及内容及同 ...

  3. 参考文献要不要首行缩进_参考文献格式要求(2015-2016-2)

    1 参考文献统一使用下列格式 一.参考文献构成 参考文献分为两个部分:正文部分的 夹注 和文后参考文献处的 参考文献条 目 . 1. 正文部分的夹注 (作者的姓 页码) 正文引用了他人的观点后,在后面 ...

  4. html参考文献_毕业设计参考文献格式(要求与范例)

    参考文献是指为撰写毕业设计而引用已经发表的有关文献,它不仅是毕业设计写作中不可缺少的重要组成部分.更是评价论文质量和水平.起点和深度的重要尺标.本文以毕业设计参考文献格式要求.写作范例为角度,为大家深 ...

  5. 交叉引用跳转不到后面_参考文献的作用与正确引用避免查重

    一.参考文献的作用 引文也叫参考文献,是指为撰写或编辑论著而引用或参考有关文献资料的著录,通常附在论文.图书章节之后,有时也以注释(附注或脚注)形式出现在正文中.引文是学术论著的重要组成部分,它表明文 ...

  6. etal斜体吗 参考文献_论文格式与论文参考文献格式

    科学技术报告.学位论文.学术论文以及其它类似文件是主要的科技信息源,是记录科学技术进步的历史性文件.为了统一这些文件的撰写.编辑.印刷.出版.发行,便于处理.储存.检索.利用.交流.传播.现将中华人民 ...

  7. 引用参考文献_引用参考文献时应注意些什么

    当我们在写毕业论文时,经常是需要通过引用参考文献来帮助我们更好地进行论证.然而,我们在引用时总是会因为一些疏忽,导致论文的中心思想与文献之间的关系不大.因此,在引用参考文献时需要谨慎梳选避免以下问题. ...

  8. python提取word参考文献_写作相关 | word中参考文献转化为.bib格式全流程

    因为在latex中编辑不方便修改和审阅,所以一般都会选择在word中编辑文本后,再放入latex中进行排版.但是如果参考文献和引用不是用endnotes等管理好的话,在导入参考文献到latex中就比较 ...

  9. 利用melendy插入参考文献_四苯基卟啉在改性磷酸锆层间的插入及荧光增强

    一.摘要 近年来,出于节约一次性能源的考虑,人们已经加大对太阳能等天然资源的利用.致力于模拟天然光合作用的研究,而光合作用中的捕光复合物又称为光子天线.光子天线中往往存在一种或几种猝灭剂,猝灭剂吸收光 ...

最新文章

  1. python⾯向对象学员管理系统
  2. VM虚拟机中的centos6.3命令行模式添加GCC
  3. Html5浏览器支持
  4. c语言cnn实现ocr字符,端到端的OCR:基于CNN的实现
  5. 查看网关物理地址命令
  6. python观察日志(part28)--数据的加载与存储
  7. 数据结构与算法python语言描述答案_《数据结构与算法Python语言描述》习题第二章第一题(python版)...
  8. Msm8960(APQ8064)平台的MSM-AOSP-kitkat编译适配(1):基础知识
  9. mysql 换算成百分比_MySQL计算百分比
  10. 建筑企业收并购系列二:吸收合并政策影响
  11. Pytorch 了解强化学习(RL)
  12. 用python绘制熊猫图案_在matplotlib中绘制熊猫日期
  13. 2022-06-25 网工进阶(十一)IS-IS-三大表(邻居表、路由表、链路状态数据库表)、LSP、CSNP、PSNP、LSP的同步过程
  14. AVR单片机开发1——IO口的输入和输出
  15. 约坡慈尉仪特诨谙凹毖仍怯滋傥丛
  16. char *const p ,char const *p,const char *p的区别
  17. php7.4新特性 多线程,PHP7新特性WhatwillbeinPHP7/PHPNG
  18. 竞拍秒购电商系统开发需求和功能架构分析
  19. SQL Server 2008 R2用户'sa'登录失败(错误18456)
  20. 徒手格斗技巧 源自特种部队 防身必备

热门文章

  1. 【算法习作】荷兰国旗问题
  2. WINDOWS 2008 AD权限管理服务(ADRMS)完全攻略
  3. [js高手之路]this知多少
  4. LoadRunner变量到参数的互换
  5. 在CentOS 6.5下搭建Nagios
  6. 关于windows图形编程 Hello2 程序的问题
  7. 从自己实现Ruby单例模式揭秘Ruby模块内幕
  8. struts -Tiles介绍
  9. js ajax通用方法,ajax的四种实现方式介绍
  10. java中Mark接口_JVM源码分析之Java对象头实现