注意：文章中部分内容来自刘建平博客，多谢多谢：刘建平：BIRCH聚类算法原理

文章目录

一：BIRCH算法概述
- （1）聚类特征CF
- （2）聚类特征树 CF Tree
- （3）聚类特征树的生成
- - A：生成规则
  - B：例子
二：BIRCH算法流程
三：sklearn学习BIRCH算法
- （1）参数（Parameters）
- （2）方法（Methods）
- （3）实例
四：sklearn中BIRCH算法实现
五：BIRCH算法优缺点
- （1）优点
- （2）缺点

一：BIRCH算法概述

BIRCH算法：该算法于1996年提出，尤其适用于大型数据库。它会增量地对输入数据进行聚类，有时甚至只需对所有数据扫描一遍就可以产生结果，当然额外的扫描肯定可以提升聚类效果，而且还可以有效处理噪声。BIRCH算法需要引入以下两个概念

聚类特征（CF）
聚类特征树（CF树）

聚类特征树类似于平衡B+树，CF树的每个结点是由若干聚类特征CF组成的

（1）聚类特征CF

聚类特征CF：一个聚类特征由一个三元组表示，它给出了一个簇的汇总描述。在大小为NNN的ddd维数据集{x1,x2,...,xN}\{x_{1},x_{2},...,x_{N}\}{x1,x2,...,xN}上，定义聚类特征如下

CF=(N,LS⃗,SS)CF=(N,\vec{LS},SS)CF=(N,LS,SS)

NNN：数据数量
LS⃗\vec{LS}LS：NNN个数据点各特征维度的线性和，也即∑i=1Nxi⃗\sum\limits_{i=1}^{N} \vec{x_{i}}i=1∑Nxi
SSSSSS：NNN个数据点各特征维度的平方和，也即∑i=1Nxi2⃗\sum\limits_{i=1}^{N} \vec{x_{i}^{2}}i=1∑Nxi2

聚类特征是可以求和的

CF1=(N1,LS1⃗,SS1)CF_{1}=(N_{1},\vec{LS_{1}},SS_{1})CF1=(N1,LS1,SS1)
CF2=(N2,LS2⃗,SS2)CF_{2}=(N_{2},\vec{LS_{2}},SS_{2})CF2=(N2,LS2,SS2)
CF1+CF2=(N1+N2,LS1⃗+LS2⃗,SS1+SS2)CF_{1}+CF_{2}=(N_{1}+N_{2},\vec{LS_{1}}+\vec{LS_{2}},SS_{1}+SS_{2})CF1+CF2=(N1+N2,LS1+LS2,SS1+SS2)

如下图，有5个样本形成的一个簇，分别是(3,4),(2,6),(4,5),(4,7),(3,8)(3, 4),(2,6),(4,5),(4,7),(3,8)(3,4),(2,6),(4,5),(4,7),(3,8)，则有

NNN：5
LS⃗\vec{LS}LS：(3+2+4+4+3,4+6+5+7+8)=(16,30)(3+2+4+4+3, 4+6+5+7+8)=(16, 30)(3+2+4+4+3,4+6+5+7+8)=(16,30)
SS:(32+22+42+42+32+42+62+52+72+82)=244SS:(3^{2}+2^{2}+4^{2}+4^{2}+3^{2} +4^{2}+6^{2}+5^{2}+7^{2}+8^{2})=244SS:(32+22+42+42+32+42+62+52+72+82)=244

（2）聚类特征树 CF Tree

聚类特征树CF Tree：一个CF树存储了层次聚类的聚类特征。它存储了层次聚类的聚类特征。下图是一个典型的CF树。CF树中每个非叶结点存储的CF是其孩子结点的CF总和，也即父结点存储的信息是其子结点存储信息的汇总。一个叶子结点至多包含LLL个条目，每个条目是一个聚类特征三元组，每个叶子结点有两个指针，即prev和next，用来把所有的叶子结点连接起来

一个CF树有如下两个参数，当阈值TTT越大时树就会越小，它们决定了树的大小

分支因子BBB：表示每个非叶结点最大的孩子数目
阈值TTT：：表示存储在树的叶子结点中的子聚类的最大半径

如下图，在CF树中，对于父结点的每个CF结点，它的(N,LS⃗,SS)(N,\vec{LS},SS)(N,LS,SS)三元组的值等于这个CF结点所指向的所有子结点的三元组之和

（3）聚类特征树的生成

A：生成规则

聚类特征树的生成规则：CF树是随数据的输入动态增长的，生成CF树的步骤如下

找到要插入的叶子结点：从CF树根结点开始，通过比较结点的CF值找到距离最近的簇，然后从该簇的孩子结点中继续找到距离最近的簇，直到最后找到叶子结点
修改叶子结点：找到叶子结点中距离最近的条目，检查该叶子结点能够再加入新条目，可以的话直接加入，不可以的话分裂叶子结点并加入条目
更新路径的CF信息：从插入的叶子结点自下而上更新CF树上的CF值信息

B：例子

这里再次重申三个参数

BBB：非叶结点的最大CF数
LLL：叶子结点的最大CF数
TTT：叶子结点每个CF的最大样本半径阈值

开始时，CF树为空。从数据集读入第一个样本点，将其放入一个新的CF三元组AAA（此时此三元组NNN=1），然后将新的CF放入根结点

继续读入第二个样本点，发现此样本点在AAA的半径TTT范围内，所以同属于一个CF，因此将此样本点加入AAA，并更新AAA的三元组（此时A的三元组中NNN=2）

继续读入第三个样本点，发现此样本点未在AAA的半径TTT范围内，所以需要一个新的CF三元组BBB来容纳这个新的值

继续读入第四个样本点，发现它在BBB的半径TTT范围内，同理

CF树是一棵平衡B+树，所以在构建的过程中需要考虑参数BBB和LLL

对于下面的LN1节点而言，如果所设LLL大于3（也即叶子结点至少容纳4个CF），则sc8这个CF就可以作为LN1的一个叶子节点

如果所设LLL小于3（也即叶子结点最多容纳2个CF），那么此时就需要分裂出一个新的分支。分裂时，从LN1下所有的CF中挑选出距离最小和最大的两个CF作为的内部节点
BBB会影响内部节点的结构：如果所设BBB小于等于3（也即非叶结点最多容纳3个CF）,则要对内部节点进行拆分，分裂的方法是相同的，就是挑选距离最近和最远的两个CF作为新的分支，分裂后的结果如下

二：BIRCH算法流程

三：sklearn学习BIRCH算法

BIRCH算法实现较为复杂，其中涉及了CF Tree这种数据结构，所以这里我们先用sklean中的BIRCH类进行学习，查看以下BIRCH算法的聚类效果
sklearn.cluster.Birch api链接

sklearn中BIRCH算法实现为下面这个类

class sklearn.cluster.Birch(*,
threshold=0.5,
branching_factor=50,
n_clusters=3,
compute_labels=True,
copy=True)

（1）参数（Parameters）

①：threshold（TTT）：决定了每个CF里所有样本形成的超球体的半径阈值。类型为float，默认为0.5

通过合并新样本和最近的子类而获得的子类半径应小于阈值，否则将会产生新的类
如果设置的越低则越有利于促进拆分
如果样本的方差较大，则一般需要增大这个默认值

②：branching_factor（BBB和LLL）：也即非叶结点的CF的最大数（BBB）和叶子结点的CF的最大数（LLL），sklearn对BBB和LLL进行了统一取值，也即branching_factor决定了CF Tree里所有节点的最大CF数。类型为int，默认为50

如果新加入的样本使得某个结点的CF数超过了branching_factor则需要拆分

③：n_clusters（即类别数KKK）：sklearn中这是一个可选参数

none：一般来说，如果你无法判断当前应该分为几类，则可以填入none。此时BIRCH算法将不会进行第4阶段
int：

④：compute_labels：表示是否标示类别输出。类型为bool，默认为True

⑤：copy：表示是否原数据进行赋值。类型为bool，默认为True

（2）方法（Methods）

①：fit(X[,y])：对于给定数据集合构建CF Tree

X：待输入的数据集合
y：不使用，此处按约定提供API一致性

②：fit_predict(X, y=None)：在X上进行聚类并返回聚类标签

X：待输入的数据集合
y：不使用，此处按约定提供API一致性
返回的是ndarray数组，用于标识每个数据所属的类别

（3）实例

from sklearn.cluster import Birch
import numpy as np
import pandas as pd
import matplotlib.pyplot as pltraw_data = pd.read_csv(r'E:\Postgraduate\Dataset\788points.csv', header=None)
raw_data.columns = ['X', 'Y']
x_axis = 'X'
y_axis = 'Y'examples_num = raw_data.shape[0]
train_data = raw_data[[x_axis, y_axis]].values.reshape(examples_num, 2)min_vals = train_data.min(0)
max_vals = train_data.max(0)
ranges = max_vals - min_vals
normal_data = np.zeros(np.shape(train_data))
nums = train_data.shape[0]
normal_data = train_data - np.tile(min_vals, (nums, 1))
normal_data = normal_data / np.tile(ranges, (nums, 1))model = Birch(threshold=0.1, n_clusters=7)
label = model.fit_predict(normal_data)
plt.scatter(normal_data[:, 0], normal_data[:, 1], c=label)
plt.show()

四：sklearn中BIRCH算法实现

BIRCH使用Python实现较为复杂，而且还涉及到平衡树的调整等等问题，所以如果从0开始写的确有点费劲，要重复造很多的轮子。所以这里给出sklearn中BIRCH算法的实现代码，以供参考

# Authors: Manoj Kumar <manojkumarsivaraj334@gmail.com>
#          Alexandre Gramfort <alexandre.gramfort@telecom-paristech.fr>
#          Joel Nothman <joel.nothman@gmail.com>
# License: BSD 3 clauseimport warnings
import numpy as np
from numbers import Integral, Real
from scipy import sparse
from math import sqrtfrom ..metrics import pairwise_distances_argmin
from ..metrics.pairwise import euclidean_distances
from ..base import (TransformerMixin,ClusterMixin,BaseEstimator,_ClassNamePrefixFeaturesOutMixin,
)
from ..utils.extmath import row_norms
from ..utils import deprecated
from ..utils._param_validation import Interval
from ..utils.validation import check_is_fitted
from ..exceptions import ConvergenceWarning
from . import AgglomerativeClustering
from .._config import config_contextdef _iterate_sparse_X(X):"""This little hack returns a densified row when iterating over a sparsematrix, instead of constructing a sparse matrix for every row that isexpensive."""n_samples = X.shape[0]X_indices = X.indicesX_data = X.dataX_indptr = X.indptrfor i in range(n_samples):row = np.zeros(X.shape[1])startptr, endptr = X_indptr[i], X_indptr[i + 1]nonzero_indices = X_indices[startptr:endptr]row[nonzero_indices] = X_data[startptr:endptr]yield rowdef _split_node(node, threshold, branching_factor):"""The node has to be split if there is no place for a new subclusterin the node.1. Two empty nodes and two empty subclusters are initialized.2. The pair of distant subclusters are found.3. The properties of the empty subclusters and nodes are updatedaccording to the nearest distance between the subclusters to thepair of distant subclusters.4. The two nodes are set as children to the two subclusters."""new_subcluster1 = _CFSubcluster()new_subcluster2 = _CFSubcluster()new_node1 = _CFNode(threshold=threshold,branching_factor=branching_factor,is_leaf=node.is_leaf,n_features=node.n_features,dtype=node.init_centroids_.dtype,)new_node2 = _CFNode(threshold=threshold,branching_factor=branching_factor,is_leaf=node.is_leaf,n_features=node.n_features,dtype=node.init_centroids_.dtype,)new_subcluster1.child_ = new_node1new_subcluster2.child_ = new_node2if node.is_leaf:if node.prev_leaf_ is not None:node.prev_leaf_.next_leaf_ = new_node1new_node1.prev_leaf_ = node.prev_leaf_new_node1.next_leaf_ = new_node2new_node2.prev_leaf_ = new_node1new_node2.next_leaf_ = node.next_leaf_if node.next_leaf_ is not None:node.next_leaf_.prev_leaf_ = new_node2dist = euclidean_distances(node.centroids_, Y_norm_squared=node.squared_norm_, squared=True)n_clusters = dist.shape[0]farthest_idx = np.unravel_index(dist.argmax(), (n_clusters, n_clusters))node1_dist, node2_dist = dist[(farthest_idx,)]node1_closer = node1_dist < node2_dist# make sure node1 is closest to itself even if all distances are equal.# This can only happen when all node.centroids_ are duplicates leading to all# distances between centroids being zero.node1_closer[farthest_idx[0]] = Truefor idx, subcluster in enumerate(node.subclusters_):if node1_closer[idx]:new_node1.append_subcluster(subcluster)new_subcluster1.update(subcluster)else:new_node2.append_subcluster(subcluster)new_subcluster2.update(subcluster)return new_subcluster1, new_subcluster2class _CFNode:"""Each node in a CFTree is called a CFNode.The CFNode can have a maximum of branching_factornumber of CFSubclusters.Parameters----------threshold : floatThreshold needed for a new subcluster to enter a CFSubcluster.branching_factor : intMaximum number of CF subclusters in each node.is_leaf : boolWe need to know if the CFNode is a leaf or not, in order toretrieve the final subclusters.n_features : intThe number of features.Attributes----------subclusters_ : listList of subclusters for a particular CFNode.prev_leaf_ : _CFNodeUseful only if is_leaf is True.next_leaf_ : _CFNodenext_leaf. Useful only if is_leaf is True.the final subclusters.init_centroids_ : ndarray of shape (branching_factor + 1, n_features)Manipulate ``init_centroids_`` throughout rather than centroids_ sincethe centroids are just a view of the ``init_centroids_`` .init_sq_norm_ : ndarray of shape (branching_factor + 1,)manipulate init_sq_norm_ throughout. similar to ``init_centroids_``.centroids_ : ndarray of shape (branching_factor + 1, n_features)View of ``init_centroids_``.squared_norm_ : ndarray of shape (branching_factor + 1,)View of ``init_sq_norm_``."""def __init__(self, *, threshold, branching_factor, is_leaf, n_features, dtype):self.threshold = thresholdself.branching_factor = branching_factorself.is_leaf = is_leafself.n_features = n_features# The list of subclusters, centroids and squared norms# to manipulate throughout.self.subclusters_ = []self.init_centroids_ = np.zeros((branching_factor + 1, n_features), dtype=dtype)self.init_sq_norm_ = np.zeros((branching_factor + 1), dtype)self.squared_norm_ = []self.prev_leaf_ = Noneself.next_leaf_ = Nonedef append_subcluster(self, subcluster):n_samples = len(self.subclusters_)self.subclusters_.append(subcluster)self.init_centroids_[n_samples] = subcluster.centroid_self.init_sq_norm_[n_samples] = subcluster.sq_norm_# Keep centroids and squared norm as views. In this way# if we change init_centroids and init_sq_norm_, it is# sufficient,self.centroids_ = self.init_centroids_[: n_samples + 1, :]self.squared_norm_ = self.init_sq_norm_[: n_samples + 1]def update_split_subclusters(self, subcluster, new_subcluster1, new_subcluster2):"""Remove a subcluster from a node and update it with thesplit subclusters."""ind = self.subclusters_.index(subcluster)self.subclusters_[ind] = new_subcluster1self.init_centroids_[ind] = new_subcluster1.centroid_self.init_sq_norm_[ind] = new_subcluster1.sq_norm_self.append_subcluster(new_subcluster2)def insert_cf_subcluster(self, subcluster):"""Insert a new subcluster into the node."""if not self.subclusters_:self.append_subcluster(subcluster)return Falsethreshold = self.thresholdbranching_factor = self.branching_factor# We need to find the closest subcluster among all the# subclusters so that we can insert our new subcluster.dist_matrix = np.dot(self.centroids_, subcluster.centroid_)dist_matrix *= -2.0dist_matrix += self.squared_norm_closest_index = np.argmin(dist_matrix)closest_subcluster = self.subclusters_[closest_index]# If the subcluster has a child, we need a recursive strategy.if closest_subcluster.child_ is not None:split_child = closest_subcluster.child_.insert_cf_subcluster(subcluster)if not split_child:# If it is determined that the child need not be split, we# can just update the closest_subclusterclosest_subcluster.update(subcluster)self.init_centroids_[closest_index] = self.subclusters_[closest_index].centroid_self.init_sq_norm_[closest_index] = self.subclusters_[closest_index].sq_norm_return False# things not too good. we need to redistribute the subclusters in# our child node, and add a new subcluster in the parent# subcluster to accommodate the new child.else:new_subcluster1, new_subcluster2 = _split_node(closest_subcluster.child_,threshold,branching_factor,)self.update_split_subclusters(closest_subcluster, new_subcluster1, new_subcluster2)if len(self.subclusters_) > self.branching_factor:return Truereturn False# good to go!else:merged = closest_subcluster.merge_subcluster(subcluster, self.threshold)if merged:self.init_centroids_[closest_index] = closest_subcluster.centroid_self.init_sq_norm_[closest_index] = closest_subcluster.sq_norm_return False# not close to any other subclusters, and we still# have space, so add.elif len(self.subclusters_) < self.branching_factor:self.append_subcluster(subcluster)return False# We do not have enough space nor is it closer to an# other subcluster. We need to split.else:self.append_subcluster(subcluster)return Trueclass _CFSubcluster:"""Each subcluster in a CFNode is called a CFSubcluster.A CFSubcluster can have a CFNode has its child.Parameters----------linear_sum : ndarray of shape (n_features,), default=NoneSample. This is kept optional to allow initialization of emptysubclusters.Attributes----------n_samples_ : intNumber of samples that belong to each subcluster.linear_sum_ : ndarrayLinear sum of all the samples in a subcluster. Prevents holdingall sample data in memory.squared_sum_ : floatSum of the squared l2 norms of all samples belonging to a subcluster.centroid_ : ndarray of shape (branching_factor + 1, n_features)Centroid of the subcluster. Prevent recomputing of centroids when``CFNode.centroids_`` is called.child_ : _CFNodeChild Node of the subcluster. Once a given _CFNode is set as the childof the _CFNode, it is set to ``self.child_``.sq_norm_ : ndarray of shape (branching_factor + 1,)Squared norm of the subcluster. Used to prevent recomputing whenpairwise minimum distances are computed."""def __init__(self, *, linear_sum=None):if linear_sum is None:self.n_samples_ = 0self.squared_sum_ = 0.0self.centroid_ = self.linear_sum_ = 0else:self.n_samples_ = 1self.centroid_ = self.linear_sum_ = linear_sumself.squared_sum_ = self.sq_norm_ = np.dot(self.linear_sum_, self.linear_sum_)self.child_ = Nonedef update(self, subcluster):self.n_samples_ += subcluster.n_samples_self.linear_sum_ += subcluster.linear_sum_self.squared_sum_ += subcluster.squared_sum_self.centroid_ = self.linear_sum_ / self.n_samples_self.sq_norm_ = np.dot(self.centroid_, self.centroid_)def merge_subcluster(self, nominee_cluster, threshold):"""Check if a cluster is worthy enough to be merged. Ifyes then merge."""new_ss = self.squared_sum_ + nominee_cluster.squared_sum_new_ls = self.linear_sum_ + nominee_cluster.linear_sum_new_n = self.n_samples_ + nominee_cluster.n_samples_new_centroid = (1 / new_n) * new_lsnew_sq_norm = np.dot(new_centroid, new_centroid)# The squared radius of the cluster is defined:#   r^2  = sum_i ||x_i - c||^2 / n# with x_i the n points assigned to the cluster and c its centroid:#   c = sum_i x_i / n# This can be expanded to:#   r^2 = sum_i ||x_i||^2 / n - 2 < sum_i x_i / n, c> + n ||c||^2 / n# and therefore simplifies to:#   r^2 = sum_i ||x_i||^2 / n - ||c||^2sq_radius = new_ss / new_n - new_sq_normif sq_radius <= threshold**2:(self.n_samples_,self.linear_sum_,self.squared_sum_,self.centroid_,self.sq_norm_,) = (new_n, new_ls, new_ss, new_centroid, new_sq_norm)return Truereturn False@propertydef radius(self):"""Return radius of the subcluster"""# Because of numerical issues, this could become negativesq_radius = self.squared_sum_ / self.n_samples_ - self.sq_norm_return sqrt(max(0, sq_radius))class Birch(_ClassNamePrefixFeaturesOutMixin, ClusterMixin, TransformerMixin, BaseEstimator
):"""Implements the BIRCH clustering algorithm.It is a memory-efficient, online-learning algorithm provided as analternative to :class:`MiniBatchKMeans`. It constructs a treedata structure with the cluster centroids being read off the leaf.These can be either the final cluster centroids or can be provided as inputto another clustering algorithm such as :class:`AgglomerativeClustering`.Read more in the :ref:`User Guide <birch>`... versionadded:: 0.16Parameters----------threshold : float, default=0.5The radius of the subcluster obtained by merging a new sample and theclosest subcluster should be lesser than the threshold. Otherwise a newsubcluster is started. Setting this value to be very low promotessplitting and vice-versa.branching_factor : int, default=50Maximum number of CF subclusters in each node. If a new samples enterssuch that the number of subclusters exceed the branching_factor thenthat node is split into two nodes with the subclusters redistributedin each. The parent subcluster of that node is removed and two newsubclusters are added as parents of the 2 split nodes.n_clusters : int, instance of sklearn.cluster model or None, default=3Number of clusters after the final clustering step, which treats thesubclusters from the leaves as new samples.- `None` : the final clustering step is not performed and thesubclusters are returned as they are.- :mod:`sklearn.cluster` Estimator : If a model is provided, the modelis fit treating the subclusters as new samples and the initial datais mapped to the label of the closest subcluster.- `int` : the model fit is :class:`AgglomerativeClustering` with`n_clusters` set to be equal to the int.compute_labels : bool, default=TrueWhether or not to compute labels for each fit.copy : bool, default=TrueWhether or not to make a copy of the given data. If set to False,the initial data will be overwritten.Attributes----------root_ : _CFNodeRoot of the CFTree.dummy_leaf_ : _CFNodeStart pointer to all the leaves.subcluster_centers_ : ndarrayCentroids of all subclusters read directly from the leaves.subcluster_labels_ : ndarrayLabels assigned to the centroids of the subclusters afterthey are clustered globally.labels_ : ndarray of shape (n_samples,)Array of labels assigned to the input data.if partial_fit is used instead of fit, they are assigned to thelast batch of data.n_features_in_ : intNumber of features seen during :term:`fit`... versionadded:: 0.24feature_names_in_ : ndarray of shape (`n_features_in_`,)Names of features seen during :term:`fit`. Defined only when `X`has feature names that are all strings... versionadded:: 1.0See Also--------MiniBatchKMeans : Alternative implementation that does incremental updatesof the centers' positions using mini-batches.Notes-----The tree data structure consists of nodes with each node consisting ofa number of subclusters. The maximum number of subclusters in a nodeis determined by the branching factor. Each subcluster maintains alinear sum, squared sum and the number of samples in that subcluster.In addition, each subcluster can also have a node as its child, if thesubcluster is not a member of a leaf node.For a new point entering the root, it is merged with the subcluster closestto it and the linear sum, squared sum and the number of samples of thatsubcluster are updated. This is done recursively till the properties ofthe leaf node are updated.References----------* Tian Zhang, Raghu Ramakrishnan, Maron LivnyBIRCH: An efficient data clustering method for large databases.https://www.cs.sfu.ca/CourseCentral/459/han/papers/zhang96.pdf* Roberto PerdisciJBirch - Java implementation of BIRCH clustering algorithmhttps://code.google.com/archive/p/jbirchExamples-------->>> from sklearn.cluster import Birch>>> X = [[0, 1], [0.3, 1], [-0.3, 1], [0, -1], [0.3, -1], [-0.3, -1]]>>> brc = Birch(n_clusters=None)>>> brc.fit(X)Birch(n_clusters=None)>>> brc.predict(X)array([0, 0, 0, 1, 1, 1])"""_parameter_constraints: dict = {"threshold": [Interval(Real, 0.0, None, closed="neither")],"branching_factor": [Interval(Integral, 1, None, closed="neither")],"n_clusters": [None, ClusterMixin, Interval(Integral, 1, None, closed="left")],"compute_labels": ["boolean"],"copy": ["boolean"],}def __init__(self,*,threshold=0.5,branching_factor=50,n_clusters=3,compute_labels=True,copy=True,):self.threshold = thresholdself.branching_factor = branching_factorself.n_clusters = n_clustersself.compute_labels = compute_labelsself.copy = copy# TODO: Remove in 1.2# mypy error: Decorated property not supported@deprecated(  # type: ignore"`fit_` is deprecated in 1.0 and will be removed in 1.2.")@propertydef fit_(self):return self._deprecated_fit# TODO: Remove in 1.2# mypy error: Decorated property not supported@deprecated(  # type: ignore"`partial_fit_` is deprecated in 1.0 and will be removed in 1.2.")@propertydef partial_fit_(self):return self._deprecated_partial_fitdef fit(self, X, y=None):"""Build a CF Tree for the input data.Parameters----------X : {array-like, sparse matrix} of shape (n_samples, n_features)Input data.y : IgnoredNot used, present here for API consistency by convention.Returns-------selfFitted estimator."""self._validate_params()# TODO: Remove deprecated flags in 1.2self._deprecated_fit, self._deprecated_partial_fit = True, Falsereturn self._fit(X, partial=False)def _fit(self, X, partial):has_root = getattr(self, "root_", None)first_call = not (partial and has_root)X = self._validate_data(X,accept_sparse="csr",copy=self.copy,reset=first_call,dtype=[np.float64, np.float32],)threshold = self.thresholdbranching_factor = self.branching_factorn_samples, n_features = X.shape# If partial_fit is called for the first time or fit is called, we# start a new tree.if first_call:# The first root is the leaf. Manipulate this object throughout.self.root_ = _CFNode(threshold=threshold,branching_factor=branching_factor,is_leaf=True,n_features=n_features,dtype=X.dtype,)# To enable getting back subclusters.self.dummy_leaf_ = _CFNode(threshold=threshold,branching_factor=branching_factor,is_leaf=True,n_features=n_features,dtype=X.dtype,)self.dummy_leaf_.next_leaf_ = self.root_self.root_.prev_leaf_ = self.dummy_leaf_# Cannot vectorize. Enough to convince to use cython.if not sparse.issparse(X):iter_func = iterelse:iter_func = _iterate_sparse_Xfor sample in iter_func(X):subcluster = _CFSubcluster(linear_sum=sample)split = self.root_.insert_cf_subcluster(subcluster)if split:new_subcluster1, new_subcluster2 = _split_node(self.root_, threshold, branching_factor)del self.root_self.root_ = _CFNode(threshold=threshold,branching_factor=branching_factor,is_leaf=False,n_features=n_features,dtype=X.dtype,)self.root_.append_subcluster(new_subcluster1)self.root_.append_subcluster(new_subcluster2)centroids = np.concatenate([leaf.centroids_ for leaf in self._get_leaves()])self.subcluster_centers_ = centroidsself._n_features_out = self.subcluster_centers_.shape[0]self._global_clustering(X)return selfdef _get_leaves(self):"""Retrieve the leaves of the CF Node.Returns-------leaves : list of shape (n_leaves,)List of the leaf nodes."""leaf_ptr = self.dummy_leaf_.next_leaf_leaves = []while leaf_ptr is not None:leaves.append(leaf_ptr)leaf_ptr = leaf_ptr.next_leaf_return leavesdef partial_fit(self, X=None, y=None):"""Online learning. Prevents rebuilding of CFTree from scratch.Parameters----------X : {array-like, sparse matrix} of shape (n_samples, n_features), \default=NoneInput data. If X is not provided, only the global clusteringstep is done.y : IgnoredNot used, present here for API consistency by convention.Returns-------selfFitted estimator."""self._validate_params()# TODO: Remove deprecated flags in 1.2self._deprecated_partial_fit, self._deprecated_fit = True, Falseif X is None:# Perform just the final global clustering step.self._global_clustering()return selfelse:return self._fit(X, partial=True)def _check_fit(self, X):check_is_fitted(self)if (hasattr(self, "subcluster_centers_")and X.shape[1] != self.subcluster_centers_.shape[1]):raise ValueError("Training data and predicted data do not have same number of features.")def predict(self, X):"""Predict data using the ``centroids_`` of subclusters.Avoid computation of the row norms of X.Parameters----------X : {array-like, sparse matrix} of shape (n_samples, n_features)Input data.Returns-------labels : ndarray of shape(n_samples,)Labelled data."""check_is_fitted(self)X = self._validate_data(X, accept_sparse="csr", reset=False)return self._predict(X)def _predict(self, X):"""Predict data using the ``centroids_`` of subclusters."""kwargs = {"Y_norm_squared": self._subcluster_norms}with config_context(assume_finite=True):argmin = pairwise_distances_argmin(X, self.subcluster_centers_, metric_kwargs=kwargs)return self.subcluster_labels_[argmin]def transform(self, X):"""Transform X into subcluster centroids dimension.Each dimension represents the distance from the sample point to eachcluster centroid.Parameters----------X : {array-like, sparse matrix} of shape (n_samples, n_features)Input data.Returns-------X_trans : {array-like, sparse matrix} of shape (n_samples, n_clusters)Transformed data."""check_is_fitted(self)X = self._validate_data(X, accept_sparse="csr", reset=False)with config_context(assume_finite=True):return euclidean_distances(X, self.subcluster_centers_)def _global_clustering(self, X=None):"""Global clustering for the subclusters obtained after fitting"""clusterer = self.n_clusterscentroids = self.subcluster_centers_compute_labels = (X is not None) and self.compute_labels# Preprocessing for the global clustering.not_enough_centroids = Falseif isinstance(clusterer, Integral):clusterer = AgglomerativeClustering(n_clusters=self.n_clusters)# There is no need to perform the global clustering step.if len(centroids) < self.n_clusters:not_enough_centroids = True# To use in predict to avoid recalculation.self._subcluster_norms = row_norms(self.subcluster_centers_, squared=True)if clusterer is None or not_enough_centroids:self.subcluster_labels_ = np.arange(len(centroids))if not_enough_centroids:warnings.warn("Number of subclusters found (%d) by BIRCH is less ""than (%d). Decrease the threshold."% (len(centroids), self.n_clusters),ConvergenceWarning,)else:# The global clustering step that clusters the subclusters of# the leaves. It assumes the centroids of the subclusters as# samples and finds the final centroids.self.subcluster_labels_ = clusterer.fit_predict(self.subcluster_centers_)if compute_labels:self.labels_ = self._predict(X)def _more_tags(self):return {"preserves_dtype": [np.float64, np.float32]}

五：BIRCH算法优缺点

（1）优点

与基于距离的算法相比，BIRCH算法具有如下优点

BIRCH算法的聚类是在局部进行的，也就是说每次聚类的过程只需要考虑部分数据，效率得到很大提升。同时BIRCH还可以增量聚类，只需要将新数据插入到CF树中
BIRCH考虑到了数据分布不是均匀的，也就是说数据的重要性不等同。如稠密区域的数据可以通过一个簇来表示，而稀疏区域的数据可以视为离群点而被移除。
BIRCH将有效的内存加以充分利用来生成更好的聚类，同时最小化算法I/O需求。由于聚类是在内存中利用高度平衡的树结构完成的，它的聚类时间也是线性可伸缩的
BIRCH算法采用了一种多阶段聚类的技术，即通过单遍扫描数据集可以生成一个基本的聚类结果，但是也可以扫描多遍来提高聚类的准确性

（2）缺点

BIRCH聚类也是存在缺点的，因为它用了半径的概念来控制聚类的范围。因此，当簇是非球形时，该算法不能生成良好的聚类结果

【数据聚类】第六章第二节：层次聚类算法之BIRCH算法（算法概述、流程和sklearn实现）相关推荐

初级会计实务--第六章第二节、利润表
第二节.利润表利润表,又称损益表,是反映企业在一定会计期间的经营成果的报表 1.利润表的结构利润表分为单步式和多步式,我国企业的利润表采用多步式格式 2.利润表的编制利润表的依据是"收 ...
第六章第二节 selenium+unittest测试框架之批量执行测试用例
到目前为止,我们执行的用例都很少,将用例写在一个测试文件中就可以去运行.但对于有成百上千用例的实际项目,我们总不能把所有用例都写在一个文件中,那是不现实的,本节我们将进一步学习 TestSuite(测 ...
第六章第二节：AndroidStudio地图覆盖物
文章目录覆盖物介绍添加覆盖物的步骤 OverlayOptions 坐标点覆盖物类型 addOverlay()方法标注覆盖物标注覆盖物选项类标注覆盖物工厂类标注覆盖物设置监听器代码1 显 ...
【数据聚类】第三章第二节2：K-Means算法及其Python实现（算法实现、结果展示）
pdf下载(密码:7281) 本文上接:[数据聚类]第三章第二节1:K-Means算法及其Python实现(距离度量方式.目标函数和算法流程) 本文下接:[数据聚类]第三章第二节3:K-Means算法 ...
第四章第二节数据资产盘点-数据资产盘点方法伦
第四章第二节数据资产盘点-数据资产盘点方法伦数据资产盘点可以通过业务角度的自上而下演绎和数据角度的自下而上归纳对数据资产进行盘点,编制数据资产目录 ,如图所示: 自上而下的演绎,是确保数据资产目录可 ...
R循环有两个_R语言数据分析与挖掘(第九章):聚类分析(2)——层次聚类
层次聚类(hierarchical clustering)基于簇间的相似度在不同层次上分析数据,从而形成树形的聚类结构,层次聚类一般有两种划分策略:自底向上的聚合(agglomerative)策略和自 ...
第二节认识计算机教案,第二章第二节局域网的构建教学设计_博客
<第二章第二节局域网的构建教学设计_博客>由会员分享,可在线阅读,更多相关<第二章第二节局域网的构建教学设计_博客(3页珍藏版)>请在装配图网上搜索. 1.第二章 ...
《网络是怎样连接的》第一章第二节：向DNS服务器查询Web服务器的IP地址
<网络是怎样连接的>第一章:浏览器生成消息概述:这本书以 "从在浏览器输入网址,到屏幕显示出网页,当中到底发生了什么?"为疑问,探究其中的过程.本章讲的是浏览器怎么把 ...
工程项目管理丁士昭第二版_2021年软考系统集成项目管理工程师知识点预习第十四章第二节...
听说99%的同学都来这里充电吖为了方便大家尽早投入2021年的软考考试备考中,我们已开始连载<系统集成项目管理工程师>知识点,今天带来的是第十四章第二节编制询价~ 知识点:第十四章 ...

【数据聚类】第六章第二节：层次聚类算法之BIRCH算法（算法概述、流程和sklearn实现）