ST-DBSCAN算法简述及其python实现

空间聚类是基于一定的相似性度量对空间大数据集进行分组的过程。空间聚类分析是一种无监督形式的机器学习。空间聚类应用广泛,如地理信息系统、生态环境、军事、市场分析等领域。通过空间聚类可以从空间数据集中发现隐含的信息或知识，包括空间实体聚集趋势，分布规律和发展变化趋势等。聚类方法可以大致分为:基于层次的聚类、基于划分的聚类、基于密度的聚类、基于网格的聚类、基于模型的聚类。
DBSCAN(Density-Based Spatial Clustering of Application with Noise)算法是基于密度的聚类方法的一种经典算法，于1996年由M Ester等人[1]提出。DBSCAN由两个参数控制，分别为MinPts和Epsilon。
ST-DBSCAN（Spatial Temporal-DBSCAN）算法由DeryaBirant等人[2]在2007年提出用于海洋环境研究。ST-DBSCAN是在DBSCAN基础上发展而来，相比DBSCAN多了一个维度上的聚类。需要注意的是，多的一个维度上的约束条件不一定得是时间距离，可以是与二维空间其它无相关性的维度，例如高程、颜色、温度、质量等。ST-DBSCAN算法如下（截取自[2]原文）：

算法示意如下图，左为DBSCAN，右为ST-DBSCAN

以某地共享单车数据前100条为例实现 ST-DBSCAN聚类，ST-DBSCAB参数设置为：
spatial_threshold = 500 # meters
temporal_threshold = 60 # minutes
min_neighbors = 3 # points
结果如下：

running in python3.8:

main.py

import pandas as pd
from sys import argv
import STDBSCAN
import numpy as np
csv_path = argv[0]# df_table must have the columns: 'latitude', 'longitude' and 'date_time'a = pd.read_csv('TestData.csv')
aa = a.head(n=100)
df_table = aa[['sn', 'O-Time', 'D-lat', 'O-lng']]
df_table.columns = ['id', 'date_time', 'latitude', 'longitude']
df_table['date_time'] = pd.to_timedelta(df_table['date_time'].astype(str))spatial_threshold = 500  # meters
temporal_threshold = 60  # minutes
min_neighbors = 3 # points
df_clustering = STDBSCAN.ST_DBSCAN(df_table, spatial_threshold, temporal_threshold, min_neighbors)
print(df_clustering)def plot_clusters(df):import matplotlib.pyplot as pltlabels = df['cluster'].valuesX = df[['longitude', 'latitude']].values# Black removed and is used for noise instead.unique_labels = set(labels)colors = [plt.cm.Spectral(each)for each in np.linspace(0, 1, len(unique_labels))]for k, col in zip(unique_labels, colors):if k == -1:# Black used for noisecol = [0, 0, 0, 1]class_member_mask = (labels == k)xy = X[class_member_mask]plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),markeredgecolor='k', markersize=6)plt.title('ST-DBSCAN: #n of clusters {}'.format(len(unique_labels)))plt.xlabel('longitude(E)')plt.ylabel('latitude(N)')plt.show()print(pd.value_counts(df_clustering['cluster']))
plot_clusters(df_table)

STDBSCAN.py

from datetime import timedelta
from geopy.distance import great_circle
"""
INPUTS:df={o1,o2,...,on} Set of objectsspatial_threshold = Maximum geographical coordinate (spatial) distance valuetemporal_threshold = Maximum non-spatial distance valuemin_neighbors = Minimun number of points within Eps1 and Eps2 distance
OUTPUT:C = {c1,c2,...,ck} Set of clusters
"""
def ST_DBSCAN(df, spatial_threshold, temporal_threshold, min_neighbors):cluster_label = 0NOISE = -1UNMARKED = 777777stack = []# initialize each point with unmarkeddf['cluster'] = UNMARKED# for each point in databasefor index, point in df.iterrows():if df.loc[index]['cluster'] == UNMARKED:neighborhood = retrieve_neighbors(index, df, spatial_threshold, temporal_threshold)if len(neighborhood) < min_neighbors:df.at[index, 'cluster'] = NOISEelse: # found a core pointcluster_label = cluster_label + 1df.at[index, 'cluster'] = cluster_label# assign a label to core pointfor neig_index in neighborhood: # assign core's label to its neighborhooddf.at[neig_index, 'cluster'] = cluster_labelstack.append(neig_index) # append neighborhood to stackwhile len(stack) > 0: # find new neighbors from core point neighborhoodcurrent_point_index = stack.pop()new_neighborhood = retrieve_neighbors(current_point_index, df, spatial_threshold, temporal_threshold)if len(new_neighborhood) >= min_neighbors: # current_point is a new corefor neig_index in new_neighborhood:neig_cluster = df.loc[neig_index]['cluster']if (neig_cluster != NOISE) & (neig_cluster == UNMARKED): # TODO: verify cluster average before add new pointdf.at[neig_index, 'cluster'] = cluster_labelstack.append(neig_index)return dfdef retrieve_neighbors(index_center, df, spatial_threshold, temporal_threshold):neigborhood = []center_point = df.loc[index_center]# filter by time min_time = center_point['date_time'] - timedelta(minutes = temporal_threshold)max_time = center_point['date_time'] + timedelta(minutes = temporal_threshold)df = df[(df['date_time'] >= min_time) & (df['date_time'] <= max_time)]# filter by distancefor index, point in df.iterrows():if index != index_center:distance = great_circle((center_point['latitude'], center_point['longitude']), (point['latitude'], point['longitude'])).metersif distance <= spatial_threshold:neigborhood.append(index)return neigborhood

*测试数据

参考文献
[1] Ester M, Kriegel H P, Sander J, et al. A density-based algorithm for discovering clusters in large spatial databases with noise[C]//Kdd. 1996, 96(34): 226-231.
[2] Birant D, Kut A. ST-DBSCAN: An algorithm for clustering spatial–temporal data[J]. Data & knowledge engineering, 2007, 60(1): 208-221.

ST-DBSCAN算法简述及其python实现相关推荐

图像特征算法(三)——ORB算法简述及Python中ORB特征匹配实践
计算机视觉专栏传送上一篇:图像特征算法(二)--SURF算法简述及Python标记SURF特征检测实践下一篇:持续创作中- 目录计算机视觉专栏传送一.ORB算法 1.算法简介 2.FAST寻找 ...
DBSCAN算法理论和Python实现
https://github.com/Sean16SYSU/MachineLearningImplement DBSCAN算法基于密度的聚类方法DBSCAN算法,是相当经典. 算法思路很简单. 简述 ...
【查找算法】6种常见的查找算法简述及Python代码实现
我是一个甜甜的大橙子
dbscan算法python实现_Python实现DBScan
Python实现DBScan 运行环境 Pyhton3 numpy(科学计算包) matplotlib(画图所需,不画图可不必) 计算过程 st=>start: 开始 e=>end: 结束 ...
【建模算法】dbscan算法（python实现）
[建模算法]dbscan算法(python实现) 对于学习机器学习和数据挖掘数据分析的小伙伴们来说,dbscan算法一定不会陌生.dbscan算法是一种基于密度的空间聚类算法,它可以快熟处理聚类同时有 ...
基于MATLAB与Python的DBSCAN算法代码
接上文,我们详细介绍了DBSCAN与几种常见聚类算法的对比与流程,DBSCAN聚类算法最为特殊,它是一种基于密度的聚类方法,聚类前不需要预先指定聚类的个数,接下来将DBSCAN分析代码分享 Pytho ...
机器学习强基计划7-5：图文详解密度聚类DBSCAN算法(附Python实现)
目录 0 写在前面 1 密度聚类 2 DBSCAN算法 3 Python实现 3.1 算法复现 3.2 可视化实验 0 写在前面机器学习强基计划聚焦深度和广度,加深对机器学习模型的理解与应用.&qu ...
dbscan算法python实现_挑子学习笔记：DBSCAN算法的python实现
DBSCAN(Density-Based Spatial Clustering of Applications with Noise)聚类算法,是一种基于高密度连通区域的.基于密度的聚类算法,能够将具 ...
python实现DBSCAN 算法
算法介绍参照此文章 DBSCAN 算法核心思路: 首选任意选取一个点,然后找到到这个点距离小于等于 eps 的所有的点.如果距起始点的距离在 eps 之内的数据点个数小于 min_samples,那 ...

ST-DBSCAN算法简述及其python实现

ST-DBSCAN算法简述及其python实现相关推荐

最新文章

热门文章