How to speed up nearest search in Pandas (perhaps by vectorizing code)

I have two dataframes. Each one contains locations (X,Y) and a value for that point. For each point in the first dataframe I want to find the closest point in the second dataframe and then find the difference. I have code that is working, but it uses a for loop, which is slow.

Any suggestions for how to speed this up? I know that it is generally a good idea to get rid of for loops in pandas, for performance, but I don't see how to do that in this case.

Here is some sample code:

import pandas as pd
import numpy as np

df1=pd.DataFrame(np.random.rand(10,3), columns=['val', 'X', 'Y'])
df2=pd.DataFrame(np.random.rand(10,3), columns=['val', 'X', 'Y'])

nearest=df1.copy()  #CORRECTION.  This had been just =df1 which caused a problem when trying to compare to answers submitted.

for idx,row in nearest.iterrows():
#Find the X,Y points closest to the selected point:
    closest=df2.ix[((df2['X']-row['X'])**2+(df2['Y']-row['Y'])**2).idxmin()]
    #Set the max to the difference between the current row and the nearest one.
    nearest.loc[idx,'val']= df1.loc[idx,'val'] - closest['val']

As I am using this on larger dataframes, it takes a long time to do the calculation.

Thanks,

One cool solution to your problem involves leveraging the complex data type (builtin in python and numpy).

import numpy as np
import pandas as pd

df1=pd.DataFrame(np.random.rand(10,3), columns=['val', 'X', 'Y'])
df2=pd.DataFrame(np.random.rand(10,3), columns=['val', 'X', 'Y'])

# dataframes to numpy arrays of complex numbers
p1 = (df1['X'] + 1j * df1['Y']).values
p2 = (df2['X'] + 1j * df2['Y']).values

# calculate all the distances, between each point in
# df1 and each point in df2 (using an array-broadcasting trick)
all_dists = abs(p1[..., np.newaxis] - p2)

# find indices of the minimal distance from df1 to df2,
# and from df2 to df1
nearest_idxs1 = np.argmin(all_dists, axis = 0)
nearest_idxs2 = np.argmin(all_dists, axis = 1)

# extract the rows from the dataframes
nearest_points1 = df1.ix[nearest_idxs1].reset_index()
nearest_points2 = df2.ix[nearest_idxs2].reset_index()

This is probably much faster than using a loop, but if your series turn out to be huge, it will consume a lot of memory (quadratic in number of points).

Also, this solution works if the sets of points are of different lenths.

Here's a concrete example demostrating how this works:

df1 = pd.DataFrame([ [987, 0, 0], [888, 2,2], [2345, 3,3] ], columns=['val', 'X', 'Y'])
df2 = pd.DataFrame([ [ 1000, 1, 1 ], [2000, 9, 9] ] , columns=['val', 'X', 'Y'])

df1
    val  X  Y
0   987  0  0
1   888  2  2
2  2345  3  3

df2
    val  X  Y
0  1000  1  1
1  2000  9  9

Here, for every point in df1, df2[0]=(1,1) is the nearest point (as shown in nearest_idxs2 below). Considering the opposite problem, for (1,1), either (0,0) or (2,2) are the nearest, and for (9,9), df1[1]=(3,3) is the nearest (as shown in nearest_idxs1 below).

p1 = (df1['X'] + 1j * df1['Y']).values
p2 = (df2['X'] + 1j * df2['Y']).values
all_dists = abs(p1[..., np.newaxis] - p2)
nearest_idxs1 = np.argmin(all_dists, axis = 0)
nearest_idxs2 = np.argmin(all_dists, axis = 1)

nearest_idxs1
array([0, 2])
nearest_idxs2
array([0, 0, 0])

# It's nearest_points2 you're after:
nearest_points2 = df2.ix[nearest_idxs2].reset_index()

nearest_points2
   index   val  X  Y
0      0  1000  1  1
1      0  1000  1  1
2      0  1000  1  1

df1['val'] - nearest_points2['val']
0     -13
1    -112
2    1345

To solve the opposite problem (for each point in df2, find nearest in df1), take nearest_points1 and df2['val'] - nearest_points1['val']

来源：https://stackoverflow.com/questions/28612773/how-to-speed-up-nearest-search-in-pandas-perhaps-by-vectorizing-code

标签

python

pandas

distance

vectorization

nearest-neighbor