问题
I have a DataFrame and a lookup table. For a key in the DataFrame I would like to lookup the corresponding row in the lookup table and calculate the Euclidian distance for a number of columns. Mock data looks like
import pandas as pd
import numpy.random as rand
df = pd.DataFrame({'key':rand.randint(0, 5, 10),
'X': rand.randn(10),
'Y': rand.randn(10),
'Z': rand.randn(10)})
X Y Z key
0 0.163142 0.387871 -0.433157 3
1 -2.020957 -1.537615 -1.996704 0
2 1.249118 1.633246 0.028222 1
3 -0.019601 1.757136 0.787936 2
4 -0.039649 1.380557 0.123677 0
5 0.500814 -1.049591 -1.261868 3
6 1.175576 -0.310895 0.549420 0
7 -0.152696 0.139020 0.887219 2
8 0.491099 0.434652 0.791038 2
9 -0.231334 0.264414 0.913475 4
lookup = pd.DataFrame({'X': rand.randn(5),
'Y': rand.randn(5),
'Z': rand.randn(5)})
X Y Z
0 0.242419 -0.630230 -0.254344
1 0.799573 0.354169 1.099456
2 -0.754582 -1.882192 -1.270382
3 -1.645707 -0.131905 -0.445954
4 0.743351 0.456220 0.975457
5 0.136197 0.278329 -2.336110
For example, the zeroth column has values
df.loc[0,'X':'Z'].values
[0.163142,0.387871,-0.433157]
the key is 3 so the row in the lookup is
lookup.iloc[3,:].values
[-1.645707 -0.131905 -0.445954]
The distance is
import numpy as np
np.linalg.norm(np.array([0.163142,0.387871,-0.433157]) - np.array([-0.754582, -1.882192, -1.270382]))
2.5877304853423202
I would like to do this for every row in df and return the value as a new column. Is there a slick way to do this?
回答1:
IIUC.We using reindex
here
[scipy.spatial.distance.euclidean(df1.iloc[:,:3].values[i], df2.reindex(df1.key).values[i]) for i in range(len(df1))]
Out[440]:
[1.882090741219987,
2.9970046421720804,
1.7279094194170017,
4.245182958491777,
2.0653635497011176,
2.47293664565694,
1.2723181192492703,
3.0170858093764914,
3.341996363028691,
0.9953100819267331]
回答2:
Vectorized approach:
In [88]: (df.merge(lookup, left_on='key', right_index=True, suffixes=['1','2'])
...: .eval("sqrt((X1-X2)**2 + (Y1-Y2)**2 + (Z1-Z2)**2)"))
...:
Out[88]:
0 1.041056
5 2.381120
1 2.832168
4 1.549664
6 1.725080
2 2.593081
3 3.096872
7 2.211651
8 1.800886
9 2.976105
dtype: float64
回答3:
A somewhat cleaner and much faster version of @Wen. Still using reindex but with numpy.linalg.norm instead of scipy.spatial.distance.euclidean
import numpy as np
dims = ['X','Y','Z']
df['distance'] = np.linalg.norm((df[dims].values)-(lookup.reindex(df['key']).values), axis = 1)
来源:https://stackoverflow.com/questions/47643952/calculate-distance-based-on-a-lookup-dataframe