机器学习中的无监督学习

When it comes to analyzing & making sense of the data from the past and understanding the future world based on those data , we rely on machine learning methodologies . This field of machine learning as I have discussed in my past articles on machine learning fundamentals is broadly categorized into

在分析和理解过去的数据并基于这些数据了解未来世界时,我们依赖于机器学习方法。 正如我在过去有关机器学习基础的文章中所讨论的那样,该机器学习领域大致分为以下几类:

  • Supervised Machine Learning监督机器学习
  • Unsupervised Machine Learning无监督机器学习

要了解监督的ML,请访问: (To understand supervised ML please visit :)

集群:无监督机器学习的世界 (Clustering : The World Of Unsupervised Machine Learning)

Today, will dig deeper into the world of Unsupervised learning. To help you catch the concept , let me put up the example of e-Commerce portals like Flipkart, Amazon etc.

今天,它将更深入地研究无监督学习的世界。 为了帮助您理解这一概念,让我举一个Flipkart,Amazon等电子商务门户的示例。

Do you know how these eCommerce giants which you use everyday, manages to segment huge list of products into various categories with an intelligence which customizes the experience of browsing based on how you navigate on their portal .

您知道您每天使用的这些电子商务巨人如何利用智能根据您在门户网站上的导航方式定制浏览体验,从而将庞大的产品列表划分为各种类别。

These tailor made intelligence to categorize the products is made possible by one of the popular Unsupervised learning techniques called clustering , where they group the set of customers based on their behavior and try to make sense of the data points generated by those segments of user, to offer tailor made services.

这些流行的无监督学习技术(称为聚类 )使这些量身定制的智能能够对产品进行分类,在这种技术中,他们根据自己的行为对客户群进行分组,并试图理解由这些用户细分产生的数据点,从而提供量身定制的服务。

因此,一些受欢迎的例子是: (So, some of the popular examples are :)

  • Market segmentation

    市场细分

  • Product Segmentation

    产品细分

  • User segmentation

    用户细分

  • Organizing the system files into group of folders

    将系统文件组织到文件夹组中

  • Organizing emails into different folder category etc..将电子邮件组织到不同的文件夹类别等中。

为什么将其称为无监督? (Why it is called unsupervised ?)

Because in this field of Machine learning the data set provided to train the ML models doesn’t have any pre-defined set of labels/outcome defined with-in the data , so the prediction or segmentation of data has to be done to group the set of people, product or data into a cluster by the model itself.

因为在机器学习的此领域提供的用于训练ML模型的数据集没有在数据中定义任何预定义的标签/结果,因此必须进行数据的预测或分段才能对模型本身将一组人员,产品或数据集合到一个集群中。

例如 : (For Example :)

In case of problem where you are given the set of past data from the bank which has the list of user attributes along with one target column attributes which labels the user as

如果出现问题,您会从银行获得一组过去的数据,其中包含用户属性列表以及一个将用户标记为

  • Defaulter默认值
  • Non-Defaulter非默认值

Now our models has to be trained on these data with a known target to achieve as a result which is to predict whether any user which comes int the loan disbursal system will default or not is a kind of Supervised Machine learning model .

现在我们的模型必须在这些数据上训练有一个已知的目标,结果是可以预测进入贷款支付系统的任何用户是否会违约是一种监督机器学习模型。

But What if you had the data which has no such kind of target column available and your model has to group the customers into a set of defaulters and non-defaulter , well when your model is trained to perform these kind of segmentation it is known to be an Unsupervised learning model.

但是,如果您拥有的数据没有此类目标列可用,并且您的模型必须将客户分组为一组默认值和非默认值,那么当训练您的模型以执行此类细分时,众所周知成为无监督的学习模型。

So, with this basic understanding of unsupervised learning it’s time to get into the fundamentals of Clustering which is a kind of unsupervised learning . Here we will cover :

因此,基于对无监督学习的基本了解,是时候深入了解聚类的基础知识了,它是一种无监督学习。 在这里,我们将介绍:

  • What Is Clustering In Unsupervised ML ?

    什么是无监督ML中的聚类?

  • What Are The Types Of Clustering?

    群集的类型有哪些?

  • What Is K-Means Clustering ?

    什么是K均值聚类?

什么是群集? (What Is Clustering ?)

It is a mechanism of grouping the set of given data to create a segments based on the concept of similarity among those data points. The intuition behind the concept of similarity comes from the word called distance .

它是一种将给定数据集进行分组以基于这些数据点之间的相似性概念创建段的机制。 相似的概念背后的直觉来自于所谓的距离的话。

什么是集群? (What Is Cluster?)

It is a collection of data object which are similar

它是相似的数据对象的集合

So, it is important here to understand two highlighted world in the definition above

因此,重要的是要了解上面定义中的两个突出显示的世界

  • Similarity相似
  • Distance距离

聚类中的相似性概念: (The Concept of Similarity In Clustering :)

In cluster analysis , we stress on the concept of data point similarity, where similarity is a measure of distance between those given data points .

在聚类分析中,我们强调数据点相似性的概念,其中相似性是对给定数据点之间距离的度量。

Those distance to measure how close the given data points are used to infer how similar those data points . Some of the popular distance measuring techniques are

那些距离用来测量给定数据点的接近程度,用以推断这些数据点的相似程度。 一些流行的距离测量技术是

  • Manhattan Distance

    曼哈顿距离

  • Euclidean Distances

    欧氏距离

  • Chebyshev Distances

    切比雪夫距离

  • Minkowski Distance

    明可夫斯基距离

欧氏距离: (Euclidean Distance :)

Is probably the most common measure of distance we all are very familiar with in data science or mathematical world.

这可能是我们在数据科学或数学世界中都非常熟悉的最常见的距离度量。

As per wiki,

根据维基,

In the field of mathematics, the Euclidean distance or Euclidean metric is the “ordinary” straight-line distance between two points in Euclidean space.

在数学领域, 欧几里得距离欧几里得度量是欧几里得空间中两点之间的“普通”直线距离。

The Euclidean distance between points X and Y is the length of the line segment connecting then, In Cartesian coordinates, Euclidean distance (d) :

X点和Y点之间的欧几里得距离是连接的线段的长度,在直角坐标系中, 欧几里得距离(d):

from X to Y, or from Y to X is given by the Pythagorean formula:

从X到Y,或从Y到X由毕达哥拉斯公式给出:

欧式距离:2维,3维和N维: (Euclidean Distance : 2 Dimension, 3 Dimension & N- Dimension :)

Euclidean distance as discussed used the popular Pythagorean theorem to calculate the measure of distance between the given set of vectors/points in n dimensional space.

讨论的欧几里得距离使用流行的毕达哥拉斯定理来计算n维空间中给定向量/点集之间的距离。

Below are the formula for the same in 2, 3 and n- dimensional space :

以下是2维,3维和n维空间中的相同公式:

曼哈顿距离: (Manhattan Distance :)

Unlike Euclidean distance, where we calculated the sum of the squares of the given vector points, here the distance between two points is the

与欧几里得距离不同,我们计算给定矢量点的平方和,此处​​两点之间的距离为

sum of the absolute differences of their Cartesian coordinates.

笛卡尔坐标的绝对差之和。

This metric of distance is also known as snake distance, city block distance, or Manhattan length, This names has taken inspiration form the grid layout of most streets on the island of Manhattan, which causes the shortest path a car could take between two intersections in the borough to have length equal to the intersections’ distance in taxicab kind of geometry

这种距离度量也称为 蛇距 街区距离 曼哈顿长度 。该名称的灵感来自曼哈顿岛上大多数街道的网格布局,这导致汽车在加利福尼亚州的两个交叉点之间可以走的最短路径自治市镇的长度等于出租车形状中的相交距离

Manhattan distance which is also called a taxicab distance can be defined by the below given formula’s

曼哈顿距离,也称为出租车距离,可以通过以下公式来定义

Chebysev距离: (Chebysev Distance:)

Also popularly called as Chess Board distance :

也通常称为国际象棋棋盘距离:

It is nothing but the Max(Of Manhattan Distance )

就是最大(曼哈顿距离)

根据维基, (As per wiki,)

In mathematics, Chebyshev distance (or Tchebychev distance), maximum metric, is a metric defined on a vector space where the distance between two vectors is the greatest of their differences along any coordinate dimension. It is named after Pafnuty Chebyshev.

在数学中, Chebyshev距离 (或 Tchebychev距离 )( 最大度量 )是在向量空间上定义的 度量 ,其中两个向量之间的距离是沿任何坐标维度的最大差异。 它以 Pafnuty Chebyshev 命名

It is also known as chessboard distance, since in the game of chess the minimum number of moves needed by a king to go from one square on a chessboard to another equals the Chebyshev distance between the centers of the squares, if the squares have side length one, as represented in 2-D spatial coordinates with axes aligned to the edges of the board.

这也称为 棋盘距离 ,因为在下棋时,国王从棋盘上的一个正方形移到另一个正方形所需的最小移动次数等于正方形中心之间的切比雪夫距离。一个以二维空间坐标表示,其轴与电路板的边缘对齐。

So , for two vectors or points x and y, with standard coordinates xi and yi respectively, is given in the below figure. Also for 2 dimensional plane, we can see the formula below.

因此,下图给出了两个向量或点xy分别具有标准坐标xi和yi的情况。 同样对于二维平面,我们可以看到以下公式。

So now that we have understood the fundamentals of similarity based on measure of distance , its time to know what are the types of clustering and how do they make use of the above discussed distance metric to cluster the given vectors of data or an object .

因此,现在我们已经了解了基于距离度量的相似性基础,是时候知道什么是聚类类型了,以及它们如何利用上述距离度量来聚类给定的数据或对象矢量。

无监督学习中的聚类类型: (Types Of Clustering In Unsupervised Learning :)

There are basically two major categorization of clustering in the field of unsupervised learning

在无监督学习领域中,聚类基本上有两个主要类别

  • Connectivity based clustering : Also known as Hierarchical clustering

    基于连接性的集群:也称为分层集群

  • Centroid Based Clustering : K-Means being the most popular kind

    基于质心的聚类: K-Means是最受欢迎的一种

基于连接的群集: (Connectivity Based Clustering :)

For a tabular dataframe with N no of columns and rows, if we calculate the distance between every pair of an object in a row to find which of those are closely related or similar, to be further clustered together, we call this expensive mechanism of clustering as connectivity based clustering . The intuition behind this extensive approach is;

对于没有N个列和行的表格数据框,如果我们计算一行中每对对象之间的距离,以找出其中哪些紧密相关或相似,然后将它们进一步聚在一起,则我们将这种昂贵的聚类机制称为作为基于连接的群集。 这种广泛方法背后的直觉是:

That objects being more related to nearby objects than to the objects which are farther away

这些对象与附近的对象比与更远的对象更相关

When the size of the data set is not very large this kind of clustering is very effective , but if data set is too big , this can be really resource intensive. For example , if we have a data set with 1000 rows than it will lead of 1/2 a million pairs of data to be analysed for similarity , this could be extremely costly to process. Imagine if the no of rows becomes 10,000.

当数据集的大小不是很大时,这种聚类非常有效,但是如果数据集太大,则可能会占用大量资源。 例如,如果我们有一个包含1000行的数据集,那么它将导致1/2百万对数据的相似性被分析,这可能会非常昂贵。 想象一下,如果行数变为10,000。

So to sum up :

综上所述:

These connectivity based algorithms connect “objects” to form “clusters” based on their distance. A cluster can be described largely by the maximum distance needed to connect parts of the cluster. At different distances, different clusters will form, which can be represented using a dendrogram, which explains where the common name “hierarchical clustering” comes from, these algorithms do not provide a single partitioning of the data set, but instead provide an extensive hierarchy of clusters that merge with each other at certain distances

这些基于连通性的算法根据对象的距离将“对象”连接起来以形成“簇”。 可以通过连接集群各部分所需的最大距离来大致描述集群。 在不同的距离处,将形成不同的聚类,可以使用树状图表示,这解释了通用名称“分层聚类”的来源,这些算法不提供数据集的单个分区,而是提供一个广泛的层次结构。在特定距离彼此融合的集群

I have covered hierarchical based connectivity clustering in detail in one of my article linked below, do take some time to understand the same in more depth.

我在下面链接的一篇文章中详细介绍了基于层次的连接性群集,需要花一些时间来更深入地了解它们。

基于质心的聚类: (Centroid Based Clustering :)

Unlike hierarchical/connectivity based clustering Centroid-based clustering organizes the data into non-hierarchical clusters.

与基于层次/连接性的聚类不同,基于质心的聚类将数据组织到非层次性聚类中。

基于质心聚类的直觉: (Intuition Behind Centroid Based Clustering :)

Here we get the pre-defined number of clusters at the outset .So, Instead of visiting each and every pair of object in n no of rows to calculate the distance , this algorithm requires you to define what are no of clusters we want to obtain , based on that centroid of those clusters are identified and the distance of the data points are calculated with respect to those identified centroids.

这里我们从一开始就获得了预定义的聚类数,因此,与其访问n个行中的每一对对象都不计算距离,该算法还需要您定义要获取的聚类数,基于这些聚类的质心被识别,并针对那些识别出的质心计算数据点的距离。

This algorithm is very cheap as compared to hierarchical clustering, which can be understood by the example. So if you had 1000 rows and 5 clusters are defined at the outset . The algo has to process only 5*1000= 5000 data points , which would have been 1/2 million data points in the case of connectivity based clustering algorithm.

与分层聚类相比,该算法非常便宜,可以通过示例理解。 因此,如果您有1000行并且一开始就定义了5个群集。 该算法仅需处理5 * 1000 = 5000个数据点,在基于连接的聚类算法的情况下,将是1/2百万个数据点。

我们怎么会没有集群? (How does we come No of cluster ?)

We will answer this question when we uncover K-means clustering , but to ponder , it is related to popular method known as Elbow Method .

当我们发现K-means聚类时,我们将回答这个问题,但是要想一想,它与流行的方法Elbow Method有关。

K-均值聚类: (K-Means Clustering :)

k-means is the most widely-used centroid-based clustering algorithm. Centroid-based algorithms are efficient but sensitive to initial conditions and outliers. We will get into the details of K-Means clustering in the next part of this series of unsupervised learning , where we will cover

k均值是使用最广泛的基于质心的聚类算法。 基于质心的算法有效,但对初始条件和异常值敏感。 在本系列无监督学习的下一部分中,我们将详细介绍K-Means聚类的细节,

  • What Is K-Means Clustering ?什么是K均值聚类?
  • How does it work ?它是如何工作的 ?
  • Implementing k-means clustering algorithm using hands-on python lab使用动手python实验室实现k-means聚类算法

翻译自: https://medium.com/predict/intuition-behind-clustering-in-unsupervised-machinelearning-ff8567fb7841

机器学习中的无监督学习


http://www.taodudu.cc/news/show-863838.html

相关文章:

  • python初学者编程指南_动态编程初学者指南
  • raspberry pi_在Raspberry Pi上使用TensorFlow进行对象检测
  • 我如何在20小时内为AWS ML专业课程做好准备并进行破解
  • 使用composer_在Google Cloud Composer(Airflow)上使用Selenium搜寻网页
  • nlp自然语言处理_自然语言处理(NLP):不要重新发明轮子
  • 机器学习导论�_机器学习导论
  • 直线回归数据 离群值_处理离群值:OLS与稳健回归
  • Python中机器学习的特征选择技术
  • 聚类树状图_聚集聚类和树状图-解释
  • 机器学习与分布式机器学习_我将如何再次开始学习机器学习(3年以上)
  • 机器学习算法机器人足球_购买足球队:一种机器学习方法
  • 机器学习与不确定性_机器学习求职中的不确定性
  • pandas数据处理 代码_使用Pandas方法链接提高代码可读性
  • opencv 检测几何图形_使用OpenCV + ConvNets检测几何形状
  • 立即学习AI:03-使用卷积神经网络进行马铃薯分类
  • netflix 开源_Netflix的Polynote是一个新的开源框架,可用来构建更好的数据科学笔记本
  • 电场 大学_人工电场优化算法
  • 主题建模lda_使用LDA的Google Play商店应用评论的主题建模
  • 胶囊路由_评论:胶囊之间的动态路由
  • 交叉验证python_交叉验证
  • open ai gpt_您实际上想尝试的GPT-3 AI发明鸡尾酒
  • python 线性回归_Python中的简化线性回归
  • 机器学习模型的性能指标
  • 利用云功能和API监视Google表格中的Cloud Dataprep作业状态
  • 谷歌联合学习的论文_Google的未来联合学习
  • 使用cnn预测房价_使用CNN的人和马预测
  • 利用colab保存模型_在Google Colab上训练您的机器学习模型中的“后门”
  • java 回归遍历_回归基础:代码遍历
  • sql 12天内的数据_想要在12周内成为数据科学家吗?
  • SorterBot-第1部分

机器学习中的无监督学习_无监督机器学习中聚类背后的直觉相关推荐

  1. 监督学习无监督学习_无监督学习简介

    监督学习无监督学习 To begin with, we should know that machine primarily consists of four major domain. 首先,我们应 ...

  2. 无监督学习 k-means_无监督学习-第1部分

    无监督学习 k-means 有关深层学习的FAU讲义 (FAU LECTURE NOTES ON DEEP LEARNING) These are the lecture notes for FAU' ...

  3. 无监督学习与监督学习_有监督与无监督学习

    无监督学习与监督学习 If we don't know what the objective of the machine learning algorithm is, we may fail to ...

  4. 机器学习中的有监督学习,无监督学习,半监督学习

    在机器学习(Machine learning)领域,主要有三类不同的学习方法: 监督学习(Supervised learning). 非监督学习(Unsupervised learning). 半监督 ...

  5. 有监督学习和无监督学习_机器学习算法之监督学习和无监督学习比较

    无监督学习和监督学习是机器学习最基本的两种类型,其他的类似于它们的综合.最常用的无监督学习是从样本数据分布中,按它们的聚集来分类.例如,用一堆新旧不同的人民币硬币的尺寸和重量作为样本的数据,分布聚集在 ...

  6. 有监督学习和无监督学习_比监督学习做的更好:半监督学习

    近期大热的半监督学习! 本文转载自:AI公园 作者:Andre Ye | 编译:ronghuaiyang注:文末附CV学习交流群. 导读 为什么半监督学习是机器学习的未来. 监督学习是人工智能领域的第 ...

  7. 机器学习系列 1:监督学习和无监督学习

    https://www.toutiao.com/a6690813539747103246/ 2019-05-15 09:31:00 机器学习系列 1:监督学习和无监督学习 机器学习就是通过一大堆数据集 ...

  8. 机器学习一 -- 什么是监督学习和无监督学习?

    机器学习中的监督学习和无监督学习 说在前面 最近的我一直在寻找实习机会,很多公司给了我第一次电话面试的机会,就没有下文了.不管是HR姐姐还是第一轮的电话面试,公司员工的态度和耐心都很值得点赞,我也非常 ...

  9. 吴恩达机器学习之引言:入门、机器学习是什么、监督学习、无监督学习、推荐Octave软件进行开发

    吴恩达机器学习栏目清单 专栏直达:https://blog.csdn.net/qq_35456045/category_9762715.html 文章目录 引言(Introduction) 1.1 欢 ...

最新文章

  1. GitHub 总星 4w+!删库?女装?表情包?这些沙雕中文项目真是我每天快乐的源泉!...
  2. Tomcat 的 catalina.out 日志分割
  3. 添加firefox4的deb源,直接安装
  4. mfc连接ubuntu mysql数据库_Ubuntu 16.04 Linux系统下使用C++连接mysql数据库
  5. xshell安装mysql步骤_数据库Mysql与禅道安装
  6. oracle用户名无法登陆,sysdba却可以登陆
  7. 【iBoard电子学堂】【iCore双核心板】资料光盘A盘更新,版本号为A6
  8. [译]Hour 7 Teach.Yourself.WPF.in.24.Hours
  9. 下载anaconda时出现“Please make sure you are connected to the internet”警告
  10. java 中break如何跳出多层循环(包含二层循环)
  11. 阿里巴巴开源项目:分布式数据库同步系统otter(解决中美异地机房)
  12. HighChat动态绑定数据 数据后台绑定(三)
  13. python学习之面向对象学习进阶
  14. Membership三步曲之入门篇 - Membership基础示例
  15. Struts2 基础入门
  16. mac下查看.mobileprovision文件及钥匙串中证书.cer文件
  17. run.gps+trainer+uv+for+android,android 2.1(三星spica i5700)上的蓝牙问题配对工作但连接不起作用...
  18. Bootstrap3【上手教程】
  19. 程序员简历项目经历怎么写 ?三条原则不可忽视 【项目案例分享】
  20. funcode实验--海底世界(c++实现)

热门文章

  1. C++内存分配与对象构造的分离
  2. MusicXML 3.0 (4) - 谱号
  3. WPF and Silverlight 学习笔记(六):WPF窗体
  4. WIF(Windows Identity Foundation) 被动联合身份验证过程详解
  5. 時鐘,天氣預報--js
  6. java 注解使用_Java 注解用法
  7. java 队列复制_java - 复制堆栈或队列,而无需使用“克隆” - 堆栈内存溢出
  8. oracle复制数据库文件不动,复制数据库中需要注意的几点事项
  9. linux指定查看文件目录,【Linux】查看指定目录下的每个文件或目录的大小
  10. Yet Another Counting Problem CodeForces - 1342C(规律+前缀和)