

  • 编译环境:Python3.7
  • 编译器:Spyder 4.1.5




DBSCAN(Density-based spatial clustering of applications with noise)是由Martin Ester[8]等人最早提出的一种基于密度的空间聚类算法,该算法将具有足够密度数据的区域划分为k个不同的簇,并能在具有噪声数据的空间域内发现任意形状的簇,本文记为Cj(j=1,2…k),其中簇定义为密度相连点的最大集合,其基本原理是聚类过程要满足以下两个条件:最大性,对于空间中任意两点p、q,如果p属于簇C,并且p密度可达q,则点q也属于簇C;连接性,对于同属于簇的任意两点p、q,它们彼此是密度相连的。DBSCAN算法具有聚类速度快、能有效处理噪声点、能发现空间中任意形状簇、无需划分聚类个数等优点,但DBSCAN聚类算法也有其缺点,其聚类效果高度依赖输入参数——聚类半径和簇内最少样本点数,在高维数据的聚类中,对距离公式选取非常敏感,存在“维数灾难”。




# -*- coding: utf-8 -*-
DBSCAN: Density-Based Spatial Clustering of Applications with Noise
"""# Author: Robert Layton <robertlayton@gmail.com>
#         Joel Nothman <joel.nothman@gmail.com>
#         Lars Buitinck
# License: BSD 3 clauseimport numpy as np
import warnings
from scipy import sparsefrom ..base import BaseEstimator, ClusterMixin
from ..utils.validation import _check_sample_weight, _deprecate_positional_args
from ..neighbors import NearestNeighborsfrom ._dbscan_inner import dbscan_inner@_deprecate_positional_args
def dbscan(X, eps=0.5, *, min_samples=5, metric='minkowski',metric_params=None, algorithm='auto', leaf_size=30, p=2,sample_weight=None, n_jobs=None):"""Perform DBSCAN clustering from vector array or distance matrix.Read more in the :ref:`User Guide <dbscan>`.Parameters----------X : {array-like, sparse (CSR) matrix} of shape (n_samples, n_features) or \(n_samples, n_samples)A feature array, or array of distances between samples if``metric='precomputed'``.eps : float, default=0.5The maximum distance between two samples for one to be consideredas in the neighborhood of the other. This is not a maximum boundon the distances of points within a cluster. This is the mostimportant DBSCAN parameter to choose appropriately for your data setand distance function.min_samples : int, default=5The number of samples (or total weight) in a neighborhood for a pointto be considered as a core point. This includes the point itself.metric : string, or callableThe metric to use when calculating distance between instances in afeature array. If metric is a string or callable, it must be one ofthe options allowed by :func:`sklearn.metrics.pairwise_distances` forits metric parameter.If metric is "precomputed", X is assumed to be a distance matrix andmust be square during fit.X may be a :term:`sparse graph <sparse graph>`,in which case only "nonzero" elements may be considered neighbors.metric_params : dict, default=NoneAdditional keyword arguments for the metric function... versionadded:: 0.19algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'}, default='auto'The algorithm to be used by the NearestNeighbors moduleto compute pointwise distances and find nearest neighbors.See NearestNeighbors module documentation for details.leaf_size : int, default=30Leaf size passed to BallTree or cKDTree. This can affect the speedof the construction and query, as well as the memory requiredto store the tree. The optimal value dependson the nature of the problem.p : float, default=2The power of the Minkowski metric to be used to calculate distancebetween points.sample_weight : array-like of shape (n_samples,), default=NoneWeight of each sample, such that a sample with a weight of at least``min_samples`` is by itself a core sample; a sample with negativeweight may inhibit its eps-neighbor from being core.Note that weights are absolute, and default to 1.n_jobs : int, default=NoneThe number of parallel jobs to run for neighbors search. ``None`` means1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` meansusing all processors. See :term:`Glossary <n_jobs>` for more details.If precomputed distance are used, parallel execution is not availableand thus n_jobs will have no effect.Returns-------core_samples : ndarray of shape (n_core_samples,)Indices of core samples.labels : ndarray of shape (n_samples,)Cluster labels for each point.  Noisy samples are given the label -1.See also--------DBSCANAn estimator interface for this clustering algorithm.OPTICSA similar estimator interface clustering at multiple values of eps. Ourimplementation is optimized for memory usage.Notes-----For an example, see :ref:`examples/cluster/plot_dbscan.py<sphx_glr_auto_examples_cluster_plot_dbscan.py>`.This implementation bulk-computes all neighborhood queries, which increasesthe memory complexity to O(n.d) where d is the average number of neighbors,while original DBSCAN had memory complexity O(n). It may attract a highermemory complexity when querying these nearest neighborhoods, dependingon the ``algorithm``.One way to avoid the query complexity is to pre-compute sparseneighborhoods in chunks using:func:`NearestNeighbors.radius_neighbors_graph<sklearn.neighbors.NearestNeighbors.radius_neighbors_graph>` with``mode='distance'``, then using ``metric='precomputed'`` here.Another way to reduce memory and computation time is to remove(near-)duplicate points and use ``sample_weight`` instead.:func:`cluster.optics <sklearn.cluster.optics>` provides a similarclustering with lower memory usage.References----------Ester, M., H. P. Kriegel, J. Sander, and X. Xu, "A Density-BasedAlgorithm for Discovering Clusters in Large Spatial Databases with Noise".In: Proceedings of the 2nd International Conference on Knowledge Discoveryand Data Mining, Portland, OR, AAAI Press, pp. 226-231. 1996Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017).DBSCAN revisited, revisited: why and how you should (still) use DBSCAN.ACM Transactions on Database Systems (TODS), 42(3), 19."""est = DBSCAN(eps=eps, min_samples=min_samples, metric=metric,metric_params=metric_params, algorithm=algorithm,leaf_size=leaf_size, p=p, n_jobs=n_jobs)est.fit(X, sample_weight=sample_weight)return est.core_sample_indices_, est.labels_class DBSCAN(ClusterMixin, BaseEstimator):"""Perform DBSCAN clustering from vector array or distance matrix.DBSCAN - Density-Based Spatial Clustering of Applications with Noise.Finds core samples of high density and expands clusters from them.Good for data which contains clusters of similar density.Read more in the :ref:`User Guide <dbscan>`.Parameters----------eps : float, default=0.5The maximum distance between two samples for one to be consideredas in the neighborhood of the other. This is not a maximum boundon the distances of points within a cluster. This is the mostimportant DBSCAN parameter to choose appropriately for your data setand distance function.min_samples : int, default=5The number of samples (or total weight) in a neighborhood for a pointto be considered as a core point. This includes the point itself.metric : string, or callable, default='euclidean'The metric to use when calculating distance between instances in afeature array. If metric is a string or callable, it must be one ofthe options allowed by :func:`sklearn.metrics.pairwise_distances` forits metric parameter.If metric is "precomputed", X is assumed to be a distance matrix andmust be square. X may be a :term:`Glossary <sparse graph>`, in whichcase only "nonzero" elements may be considered neighbors for DBSCAN... versionadded:: 0.17metric *precomputed* to accept precomputed sparse matrix.metric_params : dict, default=NoneAdditional keyword arguments for the metric function... versionadded:: 0.19algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'}, default='auto'The algorithm to be used by the NearestNeighbors moduleto compute pointwise distances and find nearest neighbors.See NearestNeighbors module documentation for details.leaf_size : int, default=30Leaf size passed to BallTree or cKDTree. This can affect the speedof the construction and query, as well as the memory requiredto store the tree. The optimal value dependson the nature of the problem.p : float, default=NoneThe power of the Minkowski metric to be used to calculate distancebetween points.n_jobs : int, default=NoneThe number of parallel jobs to run.``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.``-1`` means using all processors. See :term:`Glossary <n_jobs>`for more details.Attributes----------core_sample_indices_ : ndarray of shape (n_core_samples,)Indices of core samples.components_ : ndarray of shape (n_core_samples, n_features)Copy of each core sample found by training.labels_ : ndarray of shape (n_samples)Cluster labels for each point in the dataset given to fit().Noisy samples are given the label -1.Examples-------->>> from sklearn.cluster import DBSCAN>>> import numpy as np>>> X = np.array([[1, 2], [2, 2], [2, 3],...               [8, 7], [8, 8], [25, 80]])>>> clustering = DBSCAN(eps=3, min_samples=2).fit(X)>>> clustering.labels_array([ 0,  0,  0,  1,  1, -1])>>> clusteringDBSCAN(eps=3, min_samples=2)See also--------OPTICSA similar clustering at multiple values of eps. Our implementationis optimized for memory usage.Notes-----For an example, see :ref:`examples/cluster/plot_dbscan.py<sphx_glr_auto_examples_cluster_plot_dbscan.py>`.This implementation bulk-computes all neighborhood queries, which increasesthe memory complexity to O(n.d) where d is the average number of neighbors,while original DBSCAN had memory complexity O(n). It may attract a highermemory complexity when querying these nearest neighborhoods, dependingon the ``algorithm``.One way to avoid the query complexity is to pre-compute sparseneighborhoods in chunks using:func:`NearestNeighbors.radius_neighbors_graph<sklearn.neighbors.NearestNeighbors.radius_neighbors_graph>` with``mode='distance'``, then using ``metric='precomputed'`` here.Another way to reduce memory and computation time is to remove(near-)duplicate points and use ``sample_weight`` instead.:class:`cluster.OPTICS` provides a similar clustering with lower memoryusage.References----------Ester, M., H. P. Kriegel, J. Sander, and X. Xu, "A Density-BasedAlgorithm for Discovering Clusters in Large Spatial Databases with Noise".In: Proceedings of the 2nd International Conference on Knowledge Discoveryand Data Mining, Portland, OR, AAAI Press, pp. 226-231. 1996Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017).DBSCAN revisited, revisited: why and how you should (still) use DBSCAN.ACM Transactions on Database Systems (TODS), 42(3), 19."""@_deprecate_positional_argsdef __init__(self, eps=0.5, *, min_samples=5, metric='euclidean',metric_params=None, algorithm='auto', leaf_size=30, p=None,n_jobs=None):self.eps = epsself.min_samples = min_samplesself.metric = metricself.metric_params = metric_paramsself.algorithm = algorithmself.leaf_size = leaf_sizeself.p = pself.n_jobs = n_jobsdef fit(self, X, y=None, sample_weight=None):"""Perform DBSCAN clustering from features, or distance matrix.Parameters----------X : {array-like, sparse matrix} of shape (n_samples, n_features), or \(n_samples, n_samples)Training instances to cluster, or distances between instances if``metric='precomputed'``. If a sparse matrix is provided, it willbe converted into a sparse ``csr_matrix``.sample_weight : array-like of shape (n_samples,), default=NoneWeight of each sample, such that a sample with a weight of at least``min_samples`` is by itself a core sample; a sample with anegative weight may inhibit its eps-neighbor from being core.Note that weights are absolute, and default to 1.y : IgnoredNot used, present here for API consistency by convention.Returns-------self"""X = self._validate_data(X, accept_sparse='csr')if not self.eps > 0.0:raise ValueError("eps must be positive.")if sample_weight is not None:sample_weight = _check_sample_weight(sample_weight, X)# Calculate neighborhood for all samples. This leaves the original# point in, which needs to be considered later (i.e. point i is in the# neighborhood of point i. While True, its useless information)if self.metric == 'precomputed' and sparse.issparse(X):# set the diagonal to explicit values, as a point is its own# neighborwith warnings.catch_warnings():warnings.simplefilter('ignore', sparse.SparseEfficiencyWarning)X.setdiag(X.diagonal())  # XXX: modifies X's internals in-placeneighbors_model = NearestNeighbors(radius=self.eps, algorithm=self.algorithm,leaf_size=self.leaf_size, metric=self.metric,metric_params=self.metric_params, p=self.p, n_jobs=self.n_jobs)neighbors_model.fit(X)# This has worst case O(n^2) memory complexityneighborhoods = neighbors_model.radius_neighbors(X,return_distance=False)if sample_weight is None:n_neighbors = np.array([len(neighbors)for neighbors in neighborhoods])else:n_neighbors = np.array([np.sum(sample_weight[neighbors])for neighbors in neighborhoods])# Initially, all samples are noise.labels = np.full(X.shape[0], -1, dtype=np.intp)# A list of all core samples found.core_samples = np.asarray(n_neighbors >= self.min_samples,dtype=np.uint8)dbscan_inner(core_samples, neighborhoods, labels)self.core_sample_indices_ = np.where(core_samples)[0]self.labels_ = labelsif len(self.core_sample_indices_):# fix for scipy sparse indexing issueself.components_ = X[self.core_sample_indices_].copy()else:# no core samplesself.components_ = np.empty((0, X.shape[1]))return selfdef fit_predict(self, X, y=None, sample_weight=None):"""Perform DBSCAN clustering from features or distance matrix,and return cluster labels.Parameters----------X : {array-like, sparse matrix} of shape (n_samples, n_features), or \(n_samples, n_samples)Training instances to cluster, or distances between instances if``metric='precomputed'``. If a sparse matrix is provided, it willbe converted into a sparse ``csr_matrix``.sample_weight : array-like of shape (n_samples,), default=NoneWeight of each sample, such that a sample with a weight of at least``min_samples`` is by itself a core sample; a sample with anegative weight may inhibit its eps-neighbor from being core.Note that weights are absolute, and default to 1.y : IgnoredNot used, present here for API consistency by convention.Returns-------labels : ndarray of shape (n_samples,)Cluster labels. Noisy samples are given the label -1."""self.fit(X, sample_weight=sample_weight)return self.labels_


import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
from sklearn import datasets
from  sklearn.cluster import DBSCAN
plt.rcParams['font.sans-serif'] = ['Microsoft YaHei']iris = datasets.load_iris()
X = iris.data[:, :4]
# 看看数据
plt.scatter(X[:, 0], X[:, 1], c="red", marker='o', label='see')


dbscan = DBSCAN(eps=0.4, min_samples=9)  # 1
dbscan.fit(X)   # 2
label_pred = dbscan.labels_  # 3# 绘制聚类结果
x0 = X[label_pred == 0]  # 4
x1 = X[label_pred == 1]  # 4
x2 = X[label_pred == 2]  # 4
x3 = X[label_pred == -1]   # 4
plt.scatter(x0[:, 0], x0[:, 1], c="red", marker='o', label='cluster0')
plt.scatter(x1[:, 0], x1[:, 1], c="green", marker='*', label='cluster1')
plt.scatter(x2[:, 0], x2[:, 1], c="blue", marker='+', label='cluster2')
plt.scatter(x3[:, 0], x3[:, 1], c="black", marker='D', label='noise')


  1. dbscan = DBSCAN(eps=0.4, min_samples=9) 表示设置参数,聚类半径是0.4,每个类里面的点不少于9个,也就是我前面说的三个参数中的后两个;
  2. dbscan.fit(X) 数据集拟合,机器学习无需多言;
  3. label_pred = dbscan.labels_ 聚类结果,也就是说每个点聚类的情况,如果是-1,说明算法认为这个点是噪声点,我们先来看看聚类结果,如下图(为了方便大家看数据,我把计算得到的label_pred变换了一下,将单列数据变成了6列):

  4. 后面就是根据聚类的标签值,把数据进行分类,并画出来,来看看聚类结果。




def __init__(self, eps=0.5, *, min_samples=5, metric='euclidean',metric_params=None, algorithm='auto', leaf_size=30, p=None, n_jobs=None):


import pandas as pd
import numpy as np
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from sklearn import metrics
## 第一部分
df = pd.read_csv('00-首页数据.csv')
df = df[['lat_Amap', 'lng_Amap']].dropna(axis=0,how='all')
data = np.array(df)db = DBSCAN(eps=0.005, min_samples=10).fit(data)
labels = db.labels_
raito = len(labels[labels[:] == -1]) / len(labels)  # 计算噪声点个数占总数的比例
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)  # 获取分簇的数目
score = metrics.silhouette_score(data, labels)df['label'] = labels
sns.lmplot('lat_Amap', 'lng_Amap', df, hue='label', fit_reg=False)## 第二部分
map_ = folium.Map(location=[31.574729, 120.301663], zoom_start=12,tiles='http://webrd02.is.autonavi.com/appmaptile?lang=zh_cn&size=1&scale=1&style=7&x={x}&y={y}&z={z}',attr='default')colors = ['#DC143C', '#FFB6C1', '#DB7093', '#C71585', '#8B008B', '#4B0082', '#7B68EE','#0000FF', '#B0C4DE', '#708090', '#00BFFF', '#5F9EA0', '#00FFFF', '#7FFFAA','#008000', '#FFFF00', '#808000', '#FFD700', '#FFA500', '#FF6347','#DC143C', '#FFB6C1', '#DB7093', '#C71585', '#8B008B', '#4B0082', '#7B68EE','#0000FF', '#B0C4DE', '#708090', '#00BFFF', '#5F9EA0', '#00FFFF', '#7FFFAA','#008000', '#FFFF00', '#808000', '#FFD700', '#FFA500', '#FF6347', '#DC143C', '#FFB6C1', '#DB7093', '#C71585', '#8B008B', '#4B0082', '#7B68EE','#0000FF', '#B0C4DE', '#708090', '#00BFFF', '#5F9EA0', '#00FFFF', '#7FFFAA','#008000', '#FFFF00', '#808000', '#FFD700', '#FFA500', '#FF6347','#DC143C', '#FFB6C1', '#DB7093', '#C71585', '#8B008B', '#4B0082', '#7B68EE','#0000FF', '#B0C4DE', '#708090', '#00BFFF', '#5F9EA0', '#00FFFF', '#7FFFAA','#008000', '#FFFF00', '#808000', '#FFD700', '#FFA500', '#FF6347','#DC143C', '#FFB6C1', '#DB7093', '#C71585', '#8B008B', '#4B0082', '#7B68EE','#0000FF', '#B0C4DE', '#708090', '#00BFFF', '#5F9EA0', '#00FFFF', '#7FFFAA','#008000', '#FFFF00', '#808000', '#FFD700', '#FFA500', '#FF6347', '#DC143C', '#FFB6C1', '#DB7093', '#C71585', '#8B008B', '#4B0082', '#7B68EE','#0000FF', '#B0C4DE', '#708090', '#00BFFF', '#5F9EA0', '#00FFFF', '#7FFFAA','#008000', '#FFFF00', '#808000', '#FFD700', '#FFA500', '#FF6347','#000000']for i in range(len(data)):folium.CircleMarker(location=[data[i][0], data[i][1]],radius=4, popup='popup',color=colors[labels[i]], fill=True,fill_color=colors[labels[i]]).add_to(map_)map_.save('all_cluster.html')

第一部分是原始聚类,还有看聚类结果,直接看看sns.lmplot(‘lat_Amap’, ‘lng_Amap’, df, hue=‘label’, fit_reg=False)的效果吧


db = DBSCAN(eps=0.005, min_samples=10).fit(data)


df['label'] = labels



for i in range(len(data)):folium.CircleMarker(location=[data[i][0], data[i][1]],radius=4, popup='popup',color=colors[labels[i]], fill=True,fill_color=colors[labels[i]]).add_to(map_)





# -*- coding: utf-8 -*-
Created on Fri Jun 12 10:39:07 2020@author: HP
"""# -*- coding: utf-8 -*-
Created on Wed May 20 08:32:01 2020@author: HP
"""import pandas as pd
import numpy as np
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from sklearn import metrics
from  math import radians
from math import tan,atan,acos,sin,cos,asin,sqrt
from scipy.spatial.distance import pdist, squareform
sns.set()def haversine(lonlat1, lonlat2):lat1, lon1 = lonlat1lat2, lon2 = lonlat2lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])dlon = lon2 - lon1dlat = lat2 - lat1a = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2c = 2 * asin(sqrt(a))r = 6371  # Radius of earth in kilometers. Use 3956 for milesreturn c * r * 1000df = pd.read_csv('00-首页数据.csv')df = df[['lat_Amap', 'lng_Amap']].dropna(axis=0,how='all')
# df['lon_lat'] = df.apply(lambda x: [x['lng_Amap'], x['lat_Amap']], axis=1)
# df = df['lon_lat'].to_frame()
# data = np.array(data)
# plt.figure(figsize=(10, 10))
# plt.scatter(df['lat_Amap'], df['lng_Amap'])
distance_matrix = squareform(pdist(df, (lambda u, v: haversine(u, v))))
db = DBSCAN(eps=500, min_samples=10, metric='precomputed').fit_predict(distance_matrix)'''
db = DBSCAN(eps=0.038, min_samples=3).fit(data)
labels = db
raito = len(labels[labels[:] == -1]) / len(labels)  # 计算噪声点个数占总数的比例
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)  # 获取分簇的数目
# score = metrics.silhouette_score(distance_matrix, labels)
df['label'] = labels
sns.lmplot('lat_Amap', 'lng_Amap', df, hue='label', fit_reg=False)'''
df['label'] = labels
sns.lmplot('lat_Amap', 'lng_Amap', df, hue='label', fit_reg=False)
map_all = folium.Map(location=[31.574729, 120.301663], zoom_start=12,tiles='http://webrd02.is.autonavi.com/appmaptile?lang=zh_cn&size=1&scale=1&style=7&x={x}&y={y}&z={z}',attr='default')# colors = ['#DC143C', '#FFB6C1', '#DB7093', '#C71585', '#8B008B', '#4B0082', '#7B68EE',
#          '#0000FF', '#B0C4DE', '#708090', '#00BFFF', '#5F9EA0', '#00FFFF', '#7FFFAA',
#          '#008000', '#FFFF00', '#808000', '#FFD700', '#FFA500', '#FF6347', '#000000']colors = ['#DC143C', '#FFB6C1', '#DB7093', '#C71585', '#8B008B', '#4B0082', '#7B68EE','#0000FF', '#B0C4DE', '#708090', '#00BFFF', '#5F9EA0', '#00FFFF', '#7FFFAA','#008000', '#FFFF00', '#808000', '#FFD700', '#FFA500', '#FF6347','#DC143C', '#FFB6C1', '#DB7093', '#C71585', '#8B008B', '#4B0082', '#7B68EE','#0000FF', '#B0C4DE', '#708090', '#00BFFF', '#5F9EA0', '#00FFFF', '#7FFFAA','#008000', '#FFFF00', '#808000', '#FFD700', '#FFA500', '#FF6347', '#DC143C', '#FFB6C1', '#DB7093', '#C71585', '#8B008B', '#4B0082', '#7B68EE','#0000FF', '#B0C4DE', '#708090', '#00BFFF', '#5F9EA0', '#00FFFF', '#7FFFAA','#008000', '#FFFF00', '#808000', '#FFD700', '#FFA500', '#FF6347','#DC143C', '#FFB6C1', '#DB7093', '#C71585', '#8B008B', '#4B0082', '#7B68EE','#0000FF', '#B0C4DE', '#708090', '#00BFFF', '#5F9EA0', '#00FFFF', '#7FFFAA','#008000', '#FFFF00', '#808000', '#FFD700', '#FFA500', '#FF6347','#DC143C', '#FFB6C1', '#DB7093', '#C71585', '#8B008B', '#4B0082', '#7B68EE','#0000FF', '#B0C4DE', '#708090', '#00BFFF', '#5F9EA0', '#00FFFF', '#7FFFAA','#008000', '#FFFF00', '#808000', '#FFD700', '#FFA500', '#FF6347', '#DC143C', '#FFB6C1', '#DB7093', '#C71585', '#8B008B', '#4B0082', '#7B68EE','#0000FF', '#B0C4DE', '#708090', '#00BFFF', '#5F9EA0', '#00FFFF', '#7FFFAA','#008000', '#FFFF00', '#808000', '#FFD700', '#FFA500', '#FF6347','#000000']for i in range(len(df)):if labels[i] == -1:continueelse :folium.CircleMarker(location=[df.iloc[i,0], df.iloc[i,1]],radius=4, popup='popup',color=colors[labels[i]], fill=True,fill_color=colors[labels[i]]).add_to(map_all)map_all.save('all_cluster.html')


distance_matrix = squareform(pdist(df, (lambda u, v: haversine(u, v))))



db = DBSCAN(eps=500, min_samples=10, metric='precomputed').fit_predict(distance_matrix)






score = metrics.silhouette_score(data, labels)


在聚类算法中,可以使用轮廓系数(Silhouette Coefficient)对聚类样本的聚类效果进行评估,轮廓系数的计算模型如式(1)、式(2)所示。

上式中, s(i)为样本i的轮廓系数,该值越接近1,说明样本i聚类越合理,越接近-1,说明样本i更应该分类到另外的簇,越接近0,说明样本i在两个簇的边界上; a(i)为样本i到簇内不相似度,为该样本同簇其他样本的平均距离,该值越小,说明该样本越应被聚类到该簇;b(i) 为样本i的簇间不相似度,计算公式如式(3)所示。

式(3)中, bij表示样本i到某簇Cj 所有样本的平均距离。



res = []
# 迭代不同的eps值
for eps in np.arange(0.001,0.13,0.001):# 迭代不同的min_samples值for min_samples in range(2,11):dbscan = DBSCAN(eps = eps, min_samples = min_samples)# 模型拟合dbscan.fit(data)# 统计各参数组合下的聚类个数(-1表示异常点)n_clusters = len([i for i in set(dbscan.labels_) if i != -1])# 异常点的个数outliners = np.sum(np.where(dbscan.labels_ == -1, 1,0))# 统计每个簇的样本个数# stats = pd.Series([i for i in dbscan.labels_ if i != -1]).value_counts()# 计算聚类得分try:score = metrics.silhouette_score(data, dbscan.labels_)except:score = -99res.append({'eps':eps,'min_samples':min_samples,'n_clusters':n_clusters,'outliners':outliners, 'score':score})
# 将迭代后的结果存储到数据框中
result = pd.DataFrame(res)


  1. eps的调参范围是[0.001,0.13],这个参数是根据数据特征来获取的,就是说得对数据有一定的认识才能确定调参范围,循环的步长是0.001,意思就是说,最小距离半径是0.001,最大是0.13,注意,这里用的是欧式距离计算;
  2. min_samples的调参范围是[2,11],一个簇内至少得包含两个点吧,如果最少点超过了11,那么会将所有的点聚成同一个类,参数就是这么定的,循环步长是1。





  1. DBSCAN聚类算法原理
  2. Python 机器学习实现DBSCAN聚类过程
  3. 应用欧式距离实现聚类
  4. 通过实际计算实际距离得到距离矩阵实现聚类
  5. 聚类结果上地图
  6. 根据轮廓系数调整聚类参数




  1. python机器学习库sklearn——DBSCAN密度聚类

    分享一个朋友的人工智能教程.零基础!通俗易懂!风趣幽默!还带黄段子!大家可以看看是否对自己有帮助:点击打开 全栈工程师开发手册 (作者:栾鹏) python数据挖掘系列教程 DBSCAN密度聚类的相关 ...

  2. python:实现DBSCAN聚类算法(附完整源码)

    python:实现DBSCAN聚类算法 print(__doc__)# 引入相关包import numpy as npfrom sklearn.cluster import DBSCANfrom sk ...

  3. Python计算机视觉编程第六章——图像聚类(K-means聚类,DBSCAN聚类,层次聚类,谱聚类,PCA主成分分析)

    Python计算机视觉编程 图像聚类 (一)K-means 聚类 1.1 SciPy 聚类包 1.2 图像聚类 1.1 在主成分上可视化图像 1.1 像素聚类 (二)层次聚类 (三)谱聚类 图像聚类 ...

  4. python DBSCAN聚类算法

    文章目录 DBSCAN聚类算法 基本思想 基本概念 工作流程 参数选择 DBSCAN的优劣势 代码分析 ==Matplotlib Pyplot== ==make_blobs== ==StandardS ...

  5. DBSCAN聚类算法——机器学习(理论+图解+python代码)

    一.前言 二.DBSCAN聚类算法 三.参数选择 四.DBSCAN算法迭代可视化展示 五.常用的评估方法:轮廓系数 六.用Python实现DBSCAN聚类算法 一.前言 去年学聚类算法的R语言的时候, ...

  6. 三维点云学习(4)7-ransac 地面分割+ DBSCAN聚类比较

    三维点云学习(4)7-ransac 地面分割+ DBSCAN聚类比较 回顾: 实现ransac地面分割 DBSCNA python 复现-1- 距离矩阵法 DBSCNA python 复现-2-kd- ...

  7. DBSCAN聚类︱scikit-learn中一种基于密度的聚类方式

    文章目录 @[toc] 一.DBSCAN聚类概述 1.伪代码 2.优点: 3.缺点: 4.与其他聚类算法比较 二.sklearn中的DBSCAN聚类算法 1.主要函数介绍: 最重要的两个参数: 其他主 ...

  8. Python中的GPS轨迹聚类

    当我们想要利用智能手机或智能手环等个人设备生成的GPS数据时,G PS轨迹聚类是一种常见的分析. 在本文中,我们将介绍一种在Python中执行GPS轨迹聚类的快速简便方法.这里的主要目标是创建包含&q ...

  9. K-means与DBSCAN聚类算法

    K-means与DBSCAN聚类算法 前言:目前数据聚类方法大体上可以分为划分式聚类方法(Partition-based Methods).基于密度的聚类方法(Density-based method ...


  1. JVM---运行时数据区概述
  2. 好的 blog 整理
  3. java arraylist 构造_Java基础五:构造方法、ArrayList
  4. mapreduce运行模式
  5. box-sizing详解
  6. html css记忆表,a的伪标签-css
  7. amd 深度学习模型部署_Web服务部署深度学习模型-续集
  8. 牛客假日团队赛10 L 乘积最大 (dp,大数)
  9. 在Mac的Docker中运行DotNetCore2.0
  10. 设计一个简单分页存储管理系统_【系统架构】如何设计一个简单灵活的收银系统?看这里!(1)...
  11. input file设置默认值_innodb_data_file_path设置--通过错误日志中page大小计算实际值...
  12. 博客美化---(1)
  13. 改写自SqlHelper的SqliteHelper
  14. centos7安装mysql5.7.16_centos7.x编译安装mysql5.7.16
  15. win7右键反应特别慢的问题
  16. 企业邮箱登录入口有哪些?公司邮箱账号怎么登陆更方便
  17. csust2019集训队选拔赛题解
  18. APP应用测试要点。。。。我知道的就这么多
  19. 移动端框架 - Bootstrap
  20. VB6 简单实现 支付宝二维码扫马支付


  1. 海尔微型计算机蓝屏,在网上为什么搜索不到海尔电脑关于蓝屏的解决方法?
  2. java 采集 cms_开源 java CMS - FreeCMS2.6 数据库信息采集
  3. ensp-三层网络架构实验
  4. Material Design in Action — 哔哩哔哩动画 Android 客户端
  5. Linux Ubuntn环境下---Redis缓存的安装和启动
  6. CentOS7搭建ntp时钟服务器
  7. c语言程序报告学分信息管理,C语言程序报告学生学籍信息管理系统(总21页
  8. 数据结构4-----线性表的链式存储结构(2)
  9. Bancor 2.0:ET去中心化交易所的极致体验
  10. LeetCode刷题精选(基于LeetCode企业题库2021年一月截止,字节,美团,网易,阿里,腾讯,共同题目)