Fast fuse of close points in a numpy-2d (vectorized)

问题

I have a question similar to the question asked here: simple way of fusing a few close points. I want to replace points that are located close to each other with the average of their coordinates. The closeness in cells is specified by the user (I am talking about euclidean distance).

In my case I have a lot of points (about 1-million). This method is working, but is time consuming as it uses a double for loop.

Is there a faster way to detect and fuse close points in a numpy 2d array?

To be complete I added an example:

points=array([[  382.49056159,   640.1731949 ],
   [  496.44669161,   655.8583119 ],
   [ 1255.64762859,   672.99699399],
   [ 1070.16520917,   688.33538171],
   [  318.89390168,   718.05989421],
   [  259.7106383 ,   822.2       ],
   [  141.52574427,    28.68594436],
   [ 1061.13573287,    28.7094536 ],
   [  820.57417943,    84.27702407],
   [  806.71416007,   108.50307828]])

A scatterplot of the points is visible below. The red circle indicates the points located close to each other (in this case a distance of 27.91 between the last two points in the array). So if the user would specify a minimum distance of 30 these points should be fused.

In the output of the fuse function the last to points are fused. This will look like:

#output
array([[  382.49056159,   640.1731949 ],
   [  496.44669161,   655.8583119 ],
   [ 1255.64762859,   672.99699399],
   [ 1070.16520917,   688.33538171],
   [  318.89390168,   718.05989421],
   [  259.7106383 ,   822.2       ],
   [  141.52574427,    28.68594436],
   [ 1061.13573287,    28.7094536 ],
   [  813.64416975,    96.390051175]])

回答1:

If you have a large number of points then it may be faster to build a k-D tree using scipy.spatial.cKDTree, then query it for pairs of points that are closer than some threshold:

import numpy as np
from scipy.spatial import cKDTree

tree = cKDTree(points)
rows_to_fuse = tree.query_pairs(r=30)    

print(repr(rows_to_fuse))
# {(8, 9)}

print(repr(points[list(rows_to_fuse)]))
# array([[ 820.57417943,   84.27702407],
#        [ 806.71416007,  108.50307828]])

The major advantage of this approach is that you don't need to compute the distance between every pair of points in your dataset.

回答2:

You can use scipy's distance functions such as pdist in order to quickly find which points should be merged:

import numpy as np
from scipy.spatial.distance import pdist, squareform

d = squareform(pdist(a))
d = np.ma.array(d, mask=np.isclose(d, 0))
a[d.min(axis=1) < 30]
#array([[ 820.57417943,   84.27702407],
#       [ 806.71416007,  108.50307828]])

NOTE

For large samples this method can cause memory errors since it is storing a full matrix containing the relative distances.

来源：https://stackoverflow.com/questions/36985185/fast-fuse-of-close-points-in-a-numpy-2d-vectorized

标签

python

arrays

numpy

scipy

distance