转载自:http://blog.csdn.net/lycorislqy/article/details/23595723

首先是关于Hierarchical Clustering:

A hierarchical method creates a hierarchical decomposition of the given set of data objects.  Hierarchical methods suffer from the fact that once a step (merge or split) is done, it can never be undone.

CURE

关于CURE其实很容易查到相关的内容,Datamining的书(《数据挖掘导论》)里面也有详细的介绍,算是比较好找到的资料。可以参考这个http://wiki.madio.net/index.php?doc-view-996 以及搜这篇paper:Studipto Guha, Rajeev Rastogi, Kyuseok Shim, “CURE: An Efficient Clustering Algorithm for Large Databases”

CURE是一种聚类算法,它使用各种不同的技术创造一种方法,该方法能 够处理大型数据,离群点和具有非球形和非均匀大小的簇的数据。

Shrinking representative points toward the center helps avoid problems with noise and outliers

Input: D=[p1,p2,...,pn] and the number of clusters k
Output: the set of clusters
1. Draw a random sample from the data set. 
The CURE paper is notable for explicitly deriving a formula for what the size of this sample should be in order to gurantee. with high probability, that all clusters are represented by a minimum number of points.
2. Partition the sample into p equal-sized partitions.
3. Cluser the points in each partition into m/pq clusers using CURE's hierarchical clustering algorithm to obtain a total of m/q clusters. 
4. Use CURE's hierachical clustering algorithm to cluster the m/q clusters found in the previous step until only k clusters remain.
5. Eliminate outliers. This is the second phase of outlier elimination
6. Assign all remaining data points to the nearesr cluster to obtain a complete clustering.

From the experiment before you can point out CURE is better able to handle clusters of arbitrary shapes and sizes.

But CURE Cannot Handle Differing Densities

               

CURE 不处理分类属性。
而ROCK 是一个可选的凝聚的层次聚类算法,适用于分类属性。它通过将集合的互连性与用户定义的互连性模型相比较来度量两个簇的相似度。两个簇 C1和 C2 的互连性被定义为两个簇间交叉邻接(cross link)的数目,link(pi,pj)是两个点 pI和pj共同的邻居的数目。
    换句话说,簇间相似度是基于来自不同簇而有相同邻居的点的数目。 
ROCK 首先根据相似度阈值和共同邻居的概念从给定的数据相似度矩阵构建一个稀疏的图,然后在这个稀疏图上运行一个层次聚类算法。 
ROCK (RObust Clustering using linKs)
Clustering algorithm for data with categorical and Boolean attributes
A pair of points is defined to be neighbors if their similarity is greater than some threshold
Use a hierarchical clustering scheme to cluster the data.

Obtain a sample of points from the data set
Compute the link value for each set of points, i.e., transform the original similarities (computed by Jaccard coefficient) into similarities that reflect the number of shared neighbors between points
Perform an agglomerative hierarchical clustering on the data using the “number of shared neighbors” as similarity measure and maximizing “the shared neighbors” objective function
Assign the remaining points to the clusters that have been found

ROCK的作者的paper是这一篇:Studipto Guha, Rajeev Rastogi, Kyuseok Shim,”ROCK: A Robust Clustering Algorithm for Categorical Attributes”
R语言中可以使用ROCK聚类,这里有一个很好的例子: http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/RockCluster
Chameleon和JP算法都是Graph-Based Clustering。
Graph-Based clustering uses the proximity graph
Start with the proximity matrix
Consider each point as a node in a graph
Each edge between two nodes has a weight which is the proximity between the two points
Initially the proximity graph is fully connected 
MIN (single-link) and MAX (complete-link) can be viewed as starting with this graph

In the simplest case, clusters are connected components in the graph.

Chameleon: Clustering Using Dynamic Modeling
资料来源于“http://wiki.madio.net/index.php?doc-view-997” 以及George Karypis, Eui-Hong(Sam) Han, Vipin Kumar,“Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling”
Adapt to the characteristics of the data set to find the natural clusters。
chameleon 是一个在层次聚类中采用动态模型的聚类算法。在它的聚类过程中,如果两个簇间的互连性和近似度与簇内部对象间的互连性和近似度高度相关,则合并这两个簇。基于动态模型的合并过程有利于自然的和相似的聚类的发现,而且只要定义了相似度函数就可应用于所有类型的数据。 
Use a dynamic model to measure the similarity between clusters

  • Main property is the relative closeness and relative inter-connectivity of the cluster
  • Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters
  • The merging scheme preserves self-similarity
其中self-similarity很重要。书中和上面的paper中都有详细解释。
Chameleon详细算法如下:
  1. Build a k-nearest neighbor graph
  2. Partition the graph using a multilebel graph partitioning algorithm
  3. Repeat
  4. Merge the clusters that best preserve the cluster self-similarity with respect to relative interconnectibity and relative closeness
  5. Until No more cluster can be merged

Chameleon 的产生是基于对两个层次聚类算法 CURE 和 ROCK的缺点的观察。CURE 及其相关的方案忽略了关于两个不同簇中的对象的互连性的信息,而 ROCK及其相关的方案强调对象间互连性,却忽略了关于对象间近似度的信息。

“chameleon 怎样工作的呢?”Chameleon 首先通过一个图划分算法将数据对象聚类为大量相对较小的子类,然后用一个凝聚的层次聚类算法通过反复地合并子类来找到真正的结果簇。它既考虑了互连性,又考虑了簇间的近似度,特别是簇内部的特征,来确定最相似的子类。这样它不依赖于一个静态的,用户提供的模型,能够自动地适应被合并的簇的内部特征。 
Jarvis-Patrick Clustering  
要搞清楚这个算法,首先要理解SNN(Shared Nearest Neighbor)
In some case, clustering techniques that rely on standard approaches to similarity and density do not produce the desired clustering result. Introduces an indirect approach to similarity that is based on the following principle:
if two points are similar to many of the same points, then they are simiar to one another, even if a direct measurement of similarity does not indicate this.
Algorithm as below:
Find the K-nearest neighbors of all points.
if two points, x and y are not among the k-nearest neighors of each other then 
similarity(x,y) <- 0
else
similarity(s,y) <- number of shared neighbors
end if
"http://btluke.com/jpclust.html"
First, the k-nearest neighbors of all points are found  
In graph terms this can be regarded as breaking all but the k strongest links from a point to other points in the proximity graph
 
A pair of points is put in the same cluster if 
any two points share more than T neighbors and 
the two points are in each others k nearest neighbor list

For instance, we might choose a nearest neighbor list of size 20 and put points in the same cluster if they share more than 10 near neighbors
Jarvis-Patrick clustering
First, the k-nearest neighbors of all points are found  

In graph terms this can be regarded as breaking all but the k strongest links from a point to other points in the proximity graph
 
A pair of points is put in the same cluster if 
any two points share more than  threshold neighbors and 
the two points are in each others k nearest neighbor list

For instance, we might choose a nearest neighbor list of size 20 and put points in the same cluster if they share more than 10 near neighbors

Advantages:
(1)It is good at dealing with noise and outliers and can handle cluters of different sizes,shapes, and densities.
(2)Work well for high-dimensional data and is particularly good at finding tight cluster of strongly related objects. 
But:
brittle.It may split true clusters or join cluster that should be kept separate.
not all objects are cluster
It's hard to choosing the best values for the parameters
基本时间复杂度
Storage requirements of the JP clustering algorithm are only O(km)
The basic time complexity of JP clustering is O(m2)
And for low-dimensional Euclidean data, it can use more efficiently find the k-nearest neighbors. This can reduce the time complexity from O(m2) to O(mlogm)

Advanced clustering methods (Cure, Chameleon, Rock, Jarvis-Petrich)相关推荐

  1. 时间序列 R 10 其他进阶预测方法 Advanced forecasting methods

    1 Dynamic regression models 动态回归模型 前面的内容中要么只考虑了时间,要么只考虑了其他自变量的影响,这一节将考虑各个变量和时间的综合影响. 1.1 regression ...

  2. 数据挖掘:聚类算法CURE、SNN和ROCK

    文章目录 Hierarchical clustering: revisited CURE: another hierarchical approach CURE cannot handle diffe ...

  3. Graph Neural Networks: A Review of Methods and Applications(图神经网络:方法与应用综述)

    Graph Neural Networks: A Review of Methods and Applications 图神经网络:方法与应用综述 Jie Zhou , Ganqu Cui , Zhe ...

  4. 吴恩达深度学习课程deeplearning.ai课程作业:Class 2 Week 2 Optimization methods

    吴恩达deeplearning.ai课程作业,自己写的答案. 补充说明: 1. 评论中总有人问为什么直接复制这些notebook运行不了?请不要直接复制粘贴,不可能运行通过的,这个只是notebook ...

  5. 聚类分析(Clustering Analysis)

    聚类分析(Clustering Analysis) 聚类作为数据挖掘与统计分析的一个重要的研究领域,近年来倍受关注.从机器学习的角度看,聚类是一种无监督的机器学习方法,即事先对数据集的分布没有任何的了 ...

  6. 文献学习(part78-A)--A Survey of Clustering Algorithms for Big Data: T axonomy Empirical Analysis

    学习笔记,仅供参考,有错必纠 关键词:聚类算法.无监督学习.大数据 文章目录 A Survey of Clustering Algorithms for Big Data: T axonomy &am ...

  7. 聚类算法教程(3):层次聚类算法Hierarchical Clustering Algorithms

     基本工作原理 给定要聚类的N的对象以及N*N的距离矩阵(或者是相似性矩阵),层次式聚类方法的基本步骤(参看S.C. Johnson in 1967)如下: 1.     将每个对象归为一类,共得 ...

  8. 【论文阅读笔记】:CGD: Multi-View Clustering via Cross-View Graph Diffusion

    Summary 现有的多视图处理模型都是先进行表征学习,通过学习得到的表征得到统一的图,再利用该图进行谱聚类.本文考虑将特征通过kNN构图得到每个视图的图,再通过多视图融合迭代公式进行融合扩散.这样, ...

  9. Paper reading (四十四): Machine learning methods for metabolic pathway prediction

    论文题目:Machine learning methods for metabolic pathway prediction scholar 引用:149 页数:14 发表时间:2010.01 发表刊 ...

最新文章

  1. php开发支持的文件类型整理
  2. python 东八区
  3. TCP端口状态 LISTENING、ESTABLISHED、TIME_WAIT及CLOSE_WAIT详解,以及三次握手,滑动窗口
  4. dax 计算某一列重复出现次数
  5. CentOS YUM / RPM Error Signature Key ID BAD
  6. Android Java包
  7. Java 11:将集合转换为数组
  8. java中如何设计答题小系统_java的一点问题,设计一个答题的程序
  9. 【正则表达式】以字母或下划线开头,包含字母、数字、以及下划线
  10. 小程序css之字体镂空
  11. 【云栖大会】业务和安全的融合实践详解
  12. mysql主要的两个索引Innodb和MyIASM。
  13. linux 文件 跳板机_linux 跳板机得搭建
  14. 问世间最大的乐趣是什么?
  15. network 节点label以及相关字体设置
  16. 问卷星刷问卷(一)xpath使用
  17. C++:建立一个被称为sroot()的函数,返回其参数的二次方根。重载sroot()3次,让它返回整数、长整数与双精度的二次方根
  18. windows 系统遍历USB设备 VID和PID
  19. 西门子bop20显示电流_SIEMENS/西门子BOP20基本操作员面板使用方法说明
  20. staf linux运行模式,IBM 自动化测试框架STAF介绍

热门文章

  1. 令人厌恶的错误MSB3721,以及win10,VS2019,YOLO V4 环境搭建
  2. openssl生成Windows证书
  3. Binder内存拷贝的本质和变迁
  4. Android9.0删除高通ADsp固件(二十六)
  5. Mac下编译OpenCV for android
  6. android的surfaceflinger原理讲解
  7. Srs之state-threads研究
  8. linux命令大全-比较常用的
  9. css贝塞尔曲线 多个点_了解贝塞尔曲线的数学和Python实现示例
  10. vivado 如何创建工程模式_用Tcl定制Vivado设计实现流程