k均值算法 二分k均值算法

数据科学访谈 (Data Science Interviews)

KMeans is one of the most common and important clustering algorithms to know for a data scientist. It is, however, often the case that experienced data scientists do not have a good grasp of this algorithm. This makes KMeans an excellent topic for interviews, to get a good grasp of the understanding of one of the most foundational machine learning algorithm.

对于数据科学家而言,KMeans是最常见且最重要的聚类算法之一。 但是,通常情况下,经验丰富的数据科学家对这种算法不太了解。 这使KMeans成为面试的绝佳话题,可以很好地理解最基础的机器学习算法之一。

There are a lot of questions that can touched-on when discussing the topic:

讨论该主题时,有很多问题可以涉及:

  1. Description of the Algorithm算法说明
  2. Big O Complexity & Optimization大O复杂度和优化
  3. Application of the algorithm算法的应用
  4. Comparison with other clustering algorithms与其他聚类算法的比较
  5. Advantages / Disadvantage of using K-Means使用K均值的优点/缺点

算法说明 (Description of Algorithm)

Describing the inner working of the K-Means algorithm is typically the first step in an interview questions centered around clustering. It shows the interviewer whether you have grasped how the algorithm works.

描述K-Means算法的内部工作通常是围绕聚类的访谈问题的第一步。 它向面试官显示您是否已掌握算法的工作原理。

It might sound fine just to apply a KMeans().fit() and let the library handle all the algorithm work. Still, in case you need to debug some behavior or understand if using KMeans would be fit for purpose, it starts with having a sound understanding of how an algorithm works.

仅应用KMeans().fit()并让库处理所有算法工作,这听起来似乎不错。 尽管如此,如果您需要调试某些行为或了解使用KMeans是否适合其目的,则首先应充分了解算法的工作原理。

高级说明 (High-Level Description)

There are different aspects of K-means that are worth mentioning when describing the algorithm. The first one being that it is an unsupervised learning algorithm, aiming to group “records” based on their distances to a fixed number (i.e., k) of “centroids.” Centroids being defined as the means of the K-clusters.

描述算法时,K均值的不同方面值得一提。 第一个是它是一种无监督的学习算法,旨在根据记录与固定数量(即k)的“质心”之间的距离对“记录”进行分组。 质心定义为K-簇的均值。

内部运作 (Inner workings)

Besides the high-level description provided above, it is also essential to be able to walk an interviewer through the inner workings of the algorithm. That is from initialization, to the actual processing and the stop conditions.

除了上面提供的高级描述之外,还必须能够引导访问者了解算法的内部原理。 那就是从初始化到实际的处理以及停止条件。

Initialization: It is important to discuss that the initialization method determines the initial clusters’ means. It would be expected from this point of view, to at least mention the problem of initialization, how it can lead to different cluster being created, the impact on the time it takes to obtain the different clusters, etc.. One of the key initialization method to mention is the “Forgy” initialization method.

初始化:讨论初始化方法确定初始簇的均值很重要。 从这一角度来看,可以期望至少提及初始化问题,如何导致创建不同的集群,对获取不同集群所需时间的影响等。初始化的关键之一提及的方法是“ Forgy”初始化方法。

Processing: I would expect a discussion on how the algorithm traverses the points, and iteratively assigns them to the nearest cluster. Great candidates would be able to go beyond that description and into a discussion over KMeans, minimizing the within-cluster variance and discuss Lloyd’s algorithm.

处理:我希望能对算法如何遍历这些点并将其迭代地分配给最近的簇进行讨论。 优秀的候选人将能够超越该描述而进入有关KMeans的讨论,从而最大程度地降低集群内部差异并讨论Lloyd算法。

Stop condition: The stop conditions for the algorithm needs to be mentioned. The typical stop conditions for the algorithm are usually based on the following

停止条件:需要提及算法的停止条件。 该算法的典型停止条件通常基于以下条件

  • (stability) Centroids of new cluster do not change(稳定性)新集群的质心不变
  • (convergence) points stay in the same cluster(收敛)点保持在同一群集中
  • (cap) Maximum number of iterations has been reached(上限)已达到最大迭代次数

Stop conditions are quite important to the algorithm, and I would expect a candidate, to at least mention the stability or convergence and the cap conditions. Another key point to highlight going through these stop conditions is articulating the importance of having a cap implemented (see Big O complexity below).

停止条件对算法非常重要,我希望有一个候选人至少提及稳定性收敛性和上限条件。 突出显示通过这些停止条件的另一个关键点是阐明实施上限的重要性(请参见下面的“大O”复杂性)。

大O复杂度 (Big O Complexity)

It is important for candidates to understand the complexity of the algorithm, both from a training and prediction standpoint, and how the different variables impact the performance of the algorithm. This is why questions around the complexity of the KMeans are often asked, when deep-diving into the algorithm:

对于候选人而言,从训练和预测的角度了解算法的复杂性以及不同的变量如何影响算法的性能非常重要。 这就是为什么在深入研究算法时经常会问有关KMeans复杂性的问题:

培训BigO (Training BigO)

From a training perspective, the complexity is (if using Lloyds’ algorithm):

从训练的角度来看,复杂度是(如果使用劳埃德算法)

BigO(KmeansTraining) = K *I * N * M

BigO(KmeansTraining) = K *I * N * M

Where:

哪里:

  • K: Number of clustersK:簇数
  • I: The number of iterationsI:迭代次数
  • N: The sample sizeN:样本量
  • M: The number of variablesM:变量数

As it is possible to see, there can be a significant impact on capping the number of iterations.

可以看到,对限制迭代次数可能会产生重大影响。

预测BigO (Prediction BigO)

K-means predictions have a different complexity:

K均值预测具有不同的复杂度:

BigO(KmeansPrediction) = K * N * M

BigO(KmeansPrediction) = K * N * M

KMeans prediction, only needs to have computed for each record, the distance (which complexity is based on the number of variables) to each cluster, and assign it to the nearest one.

KMeans预测只需为每条记录计算到每个聚类的距离(其复杂度基于变量的数量),然后将其分配给最接近的一个。

扩展KMeans (Scaling KMeans)

During an interview, you might be asked if there are any ways to make KMeans perform faster on larger datasets. This should be a trigger to discuss mini-batch KMeans.

在采访中,可能会询问您是否有任何方法可以使KMeans在较大的数据集上更快地执行。 这应该是讨论迷你批处理KMeans的触发器。

Mini batch KMeans is an alternative to the traditional KMeans, that provides better performance for training on larger datasets. It leverages mini-batches of data, taken at random to update the clusters’ mean with a decreasing learning rate. For each data bach, the points are all first assigned to a cluster and then means are then re-calculated. The clusters’ centers are recalculated using gradient descent. The algorithm provides a faster convergence than the typical KMeans, but with a slightly different cluster output.

迷你批处理KMeans是传统KMeans的替代方法,可为较大数据集的训练提供更好的性能。 它利用随机获取的小批量数据,以降低的学习率来更新聚类的均值。 对于每个数据bach,首先将所有点都分配给一个聚类,然后重新计算均值。 使用梯度下降重新计算群集的中心。 该算法提供了比典型KMeans更快的收敛速度,但是群集输出略有不同。

应用K均值 (Applying K-means)

用例 (Use cases)

There are multiple use cases for leveraging the K-Means algorithm, from offering recommendations or offering some level of personalization on a website, to deep diving into potential cluster definitions from customer analysis and targeting.

有多种使用K-Means算法的用例,从在网站上提供建议或提供某种程度的个性化 ,到从客户分析和定位中深入研究潜在的集群定义。

Understanding what is expected from applying k-means also dictates how you should be applying it. Do you need to find the number of optimal number of K? or an arbitrary number given by the marketing department. Do you need to have interpretable variables, or is this something that would be better left for an algorithm to decide?

了解应用k均值的期望值还指示您应如何应用它。 您是否需要找到最佳数量的K? 或市场部门提供的任意数字。 您是否需要具有可解释的变量,还是最好由算法决定?

It is important to understand how particular K-Means use cases can impact its’ implementations. Implementation specific questions, usually come up as follow-ups, such as:

了解特定的K-Means用例如何影响其实施非常重要。 实施方面的特定问题,通常是后续问题,例如:

Let say, the marketing department asked you to providse them with user segments for an upcoming marketing campaign. What features would you look to feed into your model and what transformations woud you apply to provide them with these segments?

假设营销部门要求您为他们提供即将进行的营销活动的用户群。 您希望将哪些功能引入模型中,并希望应用哪些转换为它们提供这些细分?

This type of followup question is very open-ended, can require further clarification, but does usually provide insights into whether or not the candidate understands how the results of the segmentation might be used.

这种类型的跟进问题是开放式的,可能需要进一步澄清,但是通常会提供有关候选人是否了解如何使用细分结果的见解。

求最佳K (Finding the optimal K)

Understanding how to determine the number of K to use for KMeans often comes up as a followup question in the application of the algorithm.

理解如何确定用于KMeans的K数通常是算法应用中的后续问题。

There are different techniques to identify the optimal number of clusters to use with KMeans. Three different methods are used the Elbow method, the Silhouette method, and Gap statistics.

有多种技术可以确定与KMeans一起使用的最佳群集数。 肘部,轮廓法和间隙统计使用了三种不同的方法。

The Elbow method: is all about finding the point of inflection on a graph of % of variance explained to the number of K.

Elbow方法:都是关于在解释了K数的方差百分比图上找到拐点。

Silhouette method: The silhouette method, involves calculating for each point, a similarity/dissimilarity score between their assigned cluster, and the next best (i.e., nearest) cluster.

轮廓法:轮廓法涉及为每个点计算其分配的聚类和次佳(即最接近)的聚类之间的相似度/不相似度得分。

Gap statistics: The goal of the gap statistic is to compare the cluster assignments on the actual dataset against some randomly generated reference datasets. This comparison is done through the calculation of the intracluster variation, using the log of the sum of the pairwise distance between the clusters’ points. Large gap statistics indicates that the cluster obtained on observed data, are very different from those obtained from the randomly generated reference data.

差距统计:差距统计的目标是将实际数据集上的集群分配与一些随机生成的参考数据集进行比较。 通过使用群集点之间成对距离之和的对数,通过计算群集内变化来完成此比较。 大的间隙统计数据表明,根据观测数据获得的聚类与根据随机生成的参考数据获得的聚类有很大差异。

输入变量 (Input variables)

When applying KMeans, it is crucial to understand what kind of data can be fed to the algorithm.

应用KMeans时,至关重要的是要了解可以将哪种数据馈送到该算法。

For each user on our video streaming platform, you have been provided with their historical content views as well as their demographic data. How do you determine what to train the model on?

对于我们视频流平台上的每个用户,系统都向您提供了他们的历史内容视图以及人口统计数据。 您如何确定训练模型的依据?

It is generally an excellent way to breach into the two subtopics of variable normalization and on the number of variables.

通常,这是突破变量归一化和变量数量这两个子主题的绝佳方法。

Normalization of variables

变量归一化

In order to work correctly, KMeans typically needs to have some form of normalization done of the datasets. K-means is sensitive to both means and variance in the datasets.

为了正常工作,KMeans通常需要对数据集进行某种形式的标准化。 K均值对数据集中的均值和方差均敏感。

For numerical performing normalization using a StandardScaler is recommended, but depending on the specific cases, other techniques might be more suitable.

对于使用StandardScaler进行数值执行归一化的建议,但是根据具体情况,其他技术可能更合适。

For pure categorical data, one hot encoding would likely be preferred, but worth being careful with the number of variables it ends up producing, both from an efficiency (BigO) standpoint and for managing KMeans’ performance (see below: Number of variables).

对于纯类别数据,可能会首选一种热编码,但从效率(BigO)角度和管理KMeans的性能(请参阅下文:变量数 )的角度来看,值得谨慎对待最终产生的变量数

For mixed data types, it might be needed to pre-process the features beforehand. Techniques such as Principal Components Analysis (PCA) or Singular Value Decomposition (SVD) can, however, be used to transform the input data into a dataset that can be leveraged appropriately into KMeans.

对于混合数据类型,可能需要预先对功能进行预处理。 但是,可以使用诸如主成分分析(PCA)或奇异值分解(SVD)之类的技术将输入数据转换为可以适当利用到KMeans中的数据集。

Number of variables

变量数

The number of variables going into K-means has an impact on both the time/complexity it takes to train and apply the algorithm, but as well as an effect on how the algorithm behaves.

进入K均值的变量数量不仅影响训练和应用算法所需的时间/复杂度,还影响算法的行为方式。

This due to the curse of dimensionality:

这是由于维数的诅咒:

So as the dimensionality increases, more and more examples become nearest neighbors of xt, until the choice of nearest neighbor (and therefore of class) is effectively random.https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf

A large number of dimensions has a direct impact on distance-based computations, a key component of KMeans:

大量维度直接影响基于距离的计算,这是KMeans的关键组成部分:

The distances between a data point and its nearest and farthest neighbours can become equidistant in high dimensions, potentially compromising the accuracy of some distance-based analysis tools.https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2238676/

Dimensionality reductions methods such as PCA, or feature selection techniques are things to bring up when reaching this topic.

降维方法(例如PCA)或特征选择技术是达到此主题时需要提出的内容。

与其他算法的比较 (Comparison with other Algorithm)

Besides understanding the inner working of the KMeans algorithm, it is also important to know how it compares to other clustering algorithms.

除了了解KMeans算法的内部工作原理之外,了解与其他聚类算法的比较方式也很重要。

There is a wide range of other algorithms out there, hierarchical clustering, mean shift clustering, Gaussian mixture models (GMM), DBScan, Affinity propagation (AP), K-Medoids/ PAM, …

那里还有各种各样的其他算法,包括层次聚类,均值漂移聚类,高斯混合模型(GMM),DBScan,亲和传播(AP),K-Medoids / PAM,…

What other clustering methods do you know?

您还知道其他哪些聚类方法?

How does Algorithm X, compares to K-Means?

算法X与K均值相比如何?

Going through the list of algorithms, it is essential to at least know the different types of clustering methods: centroid/medoids (e.g., KMeans), hierarchical, density-based (e.g., MeanShift, DBSCAN). distribution-based (e.g., GMM) and Affinity propagation (Affinity Propagation)…

遍历算法列表,至少要了解不同类型的聚类方法至关重要:质心/类聚体(例如,KMeans),分层的,基于密度的(例如,MeanShift,DBSCAN)。 基于分布的(例如GMM)和相似性传播(相似性传播)…

When doing these types of comparisons, it is important to list at least some K-Means alternatives, and showcasing some high-level knowledge of what the algorithm does and how it compares to K-Means.

在进行这些类型的比较时,重要的是至少列出一些K-Means备选方案,并展示有关该算法的功能以及与K-Means进行比较的一些高级知识。

You might be asked at this point to deep dive into one of the algorithms you previously mentioned, so be prepared to be able to explain how some of the other algorithm works, list their strengths and weakness compared to K-means and describe how the inner working of the algorithm differs from K-Means.

此时可能会要求您深入研究您先前提到的一种算法,因此准备好能够解释其他一些算法的工作原理,列出它们与K均值相比的优缺点,并描述内部该算法的工作方式不同于K-Means。

使用K均值的优点/缺点 (Advantages / Disadvantage of using K-Means)

Going through any algorithms, it is important to know their advantage and disadvantage, it is not unsurprising that this is often asked during interviews.

遍历任何算法,重要的是要知道它们的优缺点,在面试中经常问到这一点并不奇怪。

Some of the key advantages of KMeans are:

KMeans的一些主要优点是:

  1. It is simple to implement实施简单
  2. Computational efficiency, both for training and prediction训练和预测的计算效率
  3. Guaranteed convergence保证融合

While some of its disadvantages are:

虽然它的一些缺点是:

  1. The number of clusters needs to be provided as an input variable.群集的数量需要作为输入变量提供。
  2. It is very dependent on the initialization process.它非常依赖于初始化过程。
  3. KMeans is good at clustering when dealing with spherical cluster shapes, but it performs poorly when dealing with more complicated shapes.KMeans在处理球形簇形状时擅长聚类,但是在处理更复杂的形状时性能较差。
  4. Due to leveraging the Euclidian distance function, it is sensitive to outliers.由于利用了欧几里得距离功能,因此对异常值很敏感。
  5. Need pre-processing on mix data as it can’t take advantages of alternative distance function such as Gower’s distance

    需要对混合数据进行预处理,因为它无法利用替代距离函数(例如高尔距离)的优势

More from me on Hacking Analytics:

我提供的有关Hacking Analytics的更多信息:

  • SQL interview Questions For Aspiring Data Scientist — The Histogram

    面向有抱负的数据科学家SQL采访问题-直方图

  • Python Screening Interview questions for DataScientists

    DataScientists的Python筛选面试问题

  • ON Applying K-means Personalization to a website

    关于将K-means个性化应用于网站

  • ON Coding K-Means in Vanilla Python

    在香草Python中编码K均值

  • How to Learn Data science from scratch

    如何从零开始学习数据科学

翻译自: https://medium.com/analytics-and-data/how-to-ace-the-k-means-algorithm-interview-questions-afe346f8fc09

k均值算法 二分k均值算法


http://www.taodudu.cc/news/show-863388.html

相关文章:

  • 支持向量机概念图解_支持向量机:基本概念
  • 如何设置Jupiter Notebook服务器并从任何地方访问它(Windows 10)
  • 无监督学习 k-means_监督学习-它意味着什么?
  • logistic 回归_具有Logistic回归的优秀初学者项目
  • 脉冲多普勒雷达_是人类还是动物? 多普勒脉冲雷达和神经网络的目标分类
  • pandas内置绘图_使用Pandas内置功能探索数据集
  • sim卡rfm_信用卡客户的RFM集群
  • 需求分析与建模最佳实践_社交媒体和主题建模:如何在实践中分析帖子
  • 机器学习 数据模型_使用PyCaret将机器学习模型运送到数据—第二部分
  • 大数据平台蓝图_数据科学面试蓝图
  • 算法竞赛训练指南代码仓库_数据仓库综合指南
  • 深度学习 图像分类_深度学习时代您应该阅读的10篇文章了解图像分类
  • 蝙蝠侠遥控器pcb_通过蝙蝠侠从Circle到ML:第一部分
  • cnn卷积神经网络_5分钟内卷积神经网络(CNN)
  • 基于树的模型的机器学习
  • 数据分析模型和工具_数据分析师工具包:模型
  • 图像梯度增强_使用梯度增强机在R中进行分类
  • 机器学习 文本分类 代码_无需担心机器学习-如何在少于10行代码中对文本进行分类
  • lr模型和dnn模型_建立ML或DNN模型的技巧
  • 数量和质量评价模型_数量对于语言模型可以具有自己的质量
  • mlflow_使用MLflow跟踪进行超参数调整
  • 聊天产生器
  • 深度学习领域专业词汇_深度学习时代的人文领域专业知识
  • 图像分类
  • CSDN-Markdown基本语法
  • python3(一)数字Number
  • python3(二)Numpy
  • python3(三)Matplotlib
  • python3(四)Pandas库
  • python3(六)监督学习

k均值算法 二分k均值算法_如何获得K均值算法面试问题相关推荐

  1. python程序设计与算法基础江红答案_《Python程序设计与算法基础教程(第二版)》江红 余青松,第十一章课后习题答案...

    推荐阅读 <Python程序设计与算法基础教程(第二版)>江红 余青松 全部章节的课后习题,上机实践,课后答案,案例研究 文章目录 一些知识点总结和几个例题 选择题:1~5 填空题:1~8 ...

  2. em算法怎么对应原有分类_机器学习系列之EM算法

    我讲EM算法的大概流程主要三部分:需要的预备知识.EM算法详解和对EM算法的改进. 一.EM算法的预备知识 1.极大似然估计 (1)举例说明:经典问题--学生身高问题 我们需要调查我们学校的男生和女生 ...

  3. em算法怎么对应原有分类_彻底搞懂EM算法

    在介绍EM算法之前,我们先来回归K-means算法. K-means其实非常简单,就是设定K个中心,然后不断的调整这k个类别的边界和中心. 整个模型有两套参数,分别是负责决定中心的 ,因为有K个中心, ...

  4. 感知器算法的基本原理和步骤_很多情况下,深度学习算法和人脑相似

    人脑模拟 深度学习背后的主要原因是人工智能应该从人脑中汲取灵感.此观点引出了"神经网络"这一术语.人 脑中 包含 数 十亿个神经元,它 们 之间有 数 万个 连 接.很多情况下,深 ...

  5. python算法的缺陷和不足_决策树基本概念及算法优缺点

    1. 什么是决策树 分类决策树模型是一种描述对实例进行分类的树形结构. 决策树由结点和有向边组成. 结点有两种类型: 内部结点和叶节点. 内部节点表示一个特征或属性, 叶节点表示一个类. 决策树(De ...

  6. 预测分析算法的设计与实现_基于LD(编辑距离算法)的单词速记数据库分析设计与实现...

    2020-21-1学期<最新数据库管理系统>结课作业展示. 作者:牟伦利 褚四浩 陈思琴 曹鹏飞(电商11802) 分工 陈思琴:系统需求分析 .系统相关算法分析和ER图 曹鹏飞:系统数据 ...

  7. gbdt 算法比随机森林容易_机器学习军火库 | 浪漫算法 随机森林

    一.基本原理 顾名思义,是用随机的方式建立一个森林,森林里面有很多的决策树组成,随机森林的每一棵决策树之间是没有关联的.在得到森林之后,当有一个新的输入样本进入的时候,就让森林中的每一棵决策树分别进行 ...

  8. 算法可以申请专利么_国内提供计算机视觉(CV)算法岗位的公司名单

    Summary:国内提供计算机视觉(CV)算法岗位的公司名单 Author:Amusi Date:2019-09-15 微信公众号:CVerhttps://github.com/amusi/CV-Jo ...

  9. python算法工程师需要学什么_成为一名 AI 算法工程师,你需要具备哪些能力?...

    这是一篇关于如何成为一名 AI 算法工程师的长文~ 经常有朋友私信问,如何学 python 呀,如何敲代码呀,如何进入 AI 行业呀? 这里总结了成为AI算法工程师所需要掌握的一些要点 来看看你距离成 ...

  10. mooc数据结构与算法python版期末测验_中国大学数据结构与算法Python版答案_MOOC慕课章节期末答案...

    中国大学数据结构与算法Python版答案_MOOC慕课章节期末答案 更多相关问题 java.lang 包的 Character 类的 isJavaIdentifierStart 方法的功能是用来判断某 ...

最新文章

  1. 【5月19日】 开源论文代码分享 分割、姿势预测,目标检测
  2. java 注解 属性 类型_跟光磊学Java开发-Java注解
  3. HDU1010:Tempter of the Bone(dfs+剪枝)
  4. centos7 配置虚拟交换机(物理交换机truckport设置)(使用brctl)
  5. 2019-03-18-算法-进化(实现strStr())
  6. 一次误操作导致的gi psu升级失败
  7. UVALive - 6436
  8. OpenCV——基于Python开发的OpenCV安装教程
  9. 解决Cannot dlopen some GPU libraries.问题
  10. 2018 开源分布式中间件 DBLE 年报
  11. 利用Tushare获取股票数据(全面详细,照着敲就可以)
  12. 微信圣诞头像来了,快给你的头像带上圣诞帽吧
  13. Jquery 模板插件 jquery.tmpl.js 的使用方法(1):基本语法,绑定,each循环,ajax获取json数据...
  14. ffmpeg 官方文档 上篇 (译)
  15. linux 磁盘碎片整理,Linux上没有磁盘碎片清理功能如何整理磁盘碎片
  16. 简单excel饼状图怎么做,bi工具怎么做饼状图
  17. 只要干不死,就往死里干
  18. 歪写数学史(当之无愧的数学王子)
  19. 【Django | 开发】权限划分(行为数据集)集成钉钉消息(案例:面试招聘信息网站)
  20. jquery简单练习题目五个

热门文章

  1. 数学沉思录:古今数学思想的发展与演变 (Mario Livio 著)
  2. SpringCloud的微服务网关:zuul(理论)
  3. 13.10 Scala中使用JSON.toJSONString报错:ambiguous reference to overloaded definition
  4. Android 设置Activity样式 透明度
  5. 为什么函数式编程很重要:不一样的白板图
  6. excel中的不同类型图表叠加
  7. python函数的作用复用代码_Python-函数和代码复用
  8. Composer 本地路径加载 laravel-admin 扩展包
  9. 解决 Windows 系统使用 Homestead 运行 Laravel 本地项目响应缓慢问题
  10. cpuz北桥频率和内存频率_内存频率怎样计算?一分钟教会你