python 速度矢量_最近邻搜索4D空间python快速-矢量化

For each observation in X (there are 20) I want to get the k(3) nearest neighbors.

How to make this fast to support up to 3 to 4 million rows?

Is it possible to speed up the loop iterating over the elements? Maybe via numpy, numba or some kind of vectorization?

A naive loop in python:

import numpy as np

from sklearn.neighbors import KDTree

n_points = 20

d_dimensions = 4

k_neighbours = 3

rng = np.random.RandomState(0)

X = rng.random_sample((n_points, d_dimensions))

print(X)

tree = KDTree(X, leaf_size=2, metric='euclidean')

for element in X:

print('********')

print(element)

# when simply using the first row

#element = X[:1]

#print(element)

# potential optimization: query_radius https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html#sklearn.neighbors.KDTree.query_radius

dist, ind = tree.query([element], k=k_neighbours, return_distance=True, dualtree=False, breadth_first=False, sort_results=True)

# indices of 3 closest neighbors

print(ind)

#[[0 9 1]] !! includes self (element that was searched for)

print(dist) # distances to 3 closest neighbors

#[[0. 0.38559188 0.40997835]] !! includes self (element that was searched for)

# actual returned elements for index:

print(X[ind])

## after removing self

print(X[ind][0][1:])

Optimally the output is a pandas.DataFrame of the following structure:

lat_1,long_1,lat_2,long_2,neighbours_list

0.5488135,0.71518937,0.60276338,0.54488318, [[0.61209572 0.616934 0.94374808 0.6818203 ][0.4236548 0.64589411 0.43758721 0.891773]

edit

For now, I have a pandas-based implementation:

df = df.dropna() # there are sometimes only parts of the tuple (either left or right) defined

X = df[['lat1', 'long1', 'lat2', 'long2']]

tree = KDTree(X, leaf_size=4, metric='euclidean')

k_neighbours = 3

def neighbors_as_list(row, index, complete_list):

dist, ind = index.query([[row['lat1'], row['long1'], row['lat2'], row['long2']]], k=k_neighbours, return_distance=True, dualtree=False, breadth_first=False, sort_results=True)

return complete_list.values[ind][0][1:]

df['neighbors'] = df.apply(neighbors_as_list, index=tree, complete_list=X, axis=1)

df.head()

But this is very slow.

edit 2

Sure, here is a pandas version:

import numpy as np

import pandas as pd

from sklearn.neighbors import KDTree

from scipy.spatial import cKDTree

rng = np.random.RandomState(0)

#n_points = 4_000_000

n_points = 20

d_dimensions = 4

k_neighbours = 3

X = rng.random_sample((n_points, d_dimensions))

df = pd.DataFrame(X)

df = df.reset_index(drop=False)

df.columns = ['id_str', 'lat_1', 'long_1', 'lat_2', 'long_2']

df.id_str = df.id_str.astype(object)

display(df.head())

tree = cKDTree(df[['lat_1', 'long_1', 'lat_2', 'long_2']])

dist,ind=tree.query(X, k=k_neighbours,n_jobs=-1)

display(dist)

print(df[['lat_1', 'long_1', 'lat_2', 'long_2']].shape)

print(X[ind_out].shape)

X[ind_out]

# fails with

# AssertionError: Shape of new values must be compatible with manager shape

df['neighbors'] = X[ind_out]

But it fails as I cannot re-assign the result.

解决方案

You could use scipy's cKdtree.

Example

rng = np.random.RandomState(0)

n_points = 4_000_000

d_dimensions = 4

k_neighbours = 3

X = rng.random_sample((n_points, d_dimensions))

tree = cKDTree(X)

#%timeit tree = cKDTree(X)

#3.74 s ± 29.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#%%timeit

_,ind=tree.query(X, k=k_neighbours,n_jobs=-1)

#shape=(4000000, 2)

ind_out=ind[:,1:]

#shape=(4000000, 2, 4)

coords_out=X[ind_out].shape

#7.13 s ± 87.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

About 11s for a problem of this size is quite good.

python 速度矢量_最近邻搜索4D空间python快速-矢量化相关推荐

print python excel分隔_合并/拆分 Excel？Python、VBA轻松自动化
作者 | Ryoko 来源 | 凹凸数据当你收集了 n 个人的 EXCEL 记录表,需要将它们汇成一个总表时你会怎么做呢? 如果不通过技术手段,要一个个打开再复制粘贴也太麻烦了吧! 此时就需要一个通 ...
python 爬虫系统_实战干货：从零快速搭建自己的爬虫系统
近期由于工作原因,需要一些数据来辅助业务决策,又无法通过外部合作获取,所以使用到了爬虫抓取相关的数据后,进行分析统计.在这个过程中,也看到很多同学爬虫相关的文章,对基础知识和所用到的技术分析得很到位, ...
python私人定制_手把手教你学python第十五讲(魔法方法续私人“定制”)
python无处不对象的深刻理解前面写了这么多,我觉得有必要从一个大的层面,也就是OO来看问题的本质.只要你调用对象的语法是合乎python的习惯的,那就是可以的,我们以前从来没有像下面这么写过,对 ...
python孩子自学_孩子也能自学Python，掌握方法入门快
孩子初学编程从什么开始入门比较好?其实Python是不错的选择,相比于其他主流的编程语言,Python具有更好的可读性,所以上手也相对比较容易,而且在这个人工智能时代,Python的发展前景也毋庸置疑 ...
怎么用python自制计算公式_手把手教你用python制作简易计算器，能够记录你使用的情况...
话不多说,首先先看效果图,它能够记录你在使用过程中的历史,方便你查看是否有错: 接下来就仔细分析一下是如何制作的: 简易计算器第一步:导入资源库在过程中使用到了tkinter这个资源库,win+R ...
python做运动控制_第一课：用Python操控小龟小车运动
欢迎来到小龟的课堂,今天我们讲如何用小龟小车的车载Python控制小车运动. 如果小伙伴还不会使用小龟小车的Python编辑器的话,可以阅读这篇教程<如何使用小龟小车的Python编辑器> ...
python文件编译_我算是白学Python了，现在才知道原来Python是可以编译的
斌哥说大家好,我是斌哥. 一说起Python,可能开发者第一时间想到的就是:"Python是一门能快速开发的解释型语言". 没错,Python确实是一门解释型的语言,而对比Jav ...
python深度爬虫_总结：常用的 Python 爬虫技巧
用python也差不多一年多了,python应用最多的场景还是web快速开发.爬虫.自动化运维:写过简单网站.写过自动发帖脚本.写过收发邮件脚本.写过简单验证码识别脚本. 爬虫在开发过程中也有很多复用 ...
python医学应用_数据分析工具鄙视链：Python、R语言是老大，Excel只能称小弟？
最新行业报告 2020数据分析.商业分析行业报告工作岗位与职能.薪资对比.热招公司等多方面详细解读帮助你一网打尽,斩获心仪Offer! 扫码回复[数据分析 0]立即领取 History语言发展史ABC ...

python 速度矢量_最近邻搜索4D空间python快速-矢量化

python 速度矢量_最近邻搜索4D空间python快速-矢量化相关推荐

最新文章

热门文章