Using pandas to identify nearest objects

问题

I have an assignment that can be done using any programming language. I chose Python and pandas since I have little experience using these and thought it would be a good learning experience. I was able to complete the assignment using traditional loops that I know from traditional computer programming, and it ran okay over thousands of rows, but it brought my laptop down to a screeching halt once I let it process millions of rows. The assignment is outlined below.

You have a two-lane road on a two-dimensional plane. One lane is for cars and the other lane is reserved for trucks. The data looks like this (spanning millions of rows for each table):

cars

   id  start  end
0  C1    200  215
1  C2    110  125
2  C3    240  255
...

trucks

   id  start  end
0  T1    115  175
1  T2    200  260
2  T3    280  340
3  T4     25   85
...

The two dataframes above correspond to this:

start and end columns represent arbitrary positions on the road, where start = the back edge of the vehicle and end = the front edge of the vehicle.

The task is to identify the trucks closest to every car. A truck can have up to three different relationships to a car:

Back - it is in back of the car (cars.end > trucks.end)
Across - it is across from the car (cars.start >= trucks.start and cars.end <= trucks.end)
Front - it is in front of the car (cars.start < trucks.start)

I emphasized "up to" because if there is another car in back or front that is closer to the nearest truck, then this relationship is ignored. In the case of the illustration above, we can observe the following:

C1: Back = T1, Across = T2, Front = none (C3 is blocking)
C2: Back = T4, Across = none, Front = T1
C3: Back = none (C1 is blocking), Across = T2, Front = T3

The final output needs to be appended to the cars dataframe along with the following new columns:

data cross-referenced from the trucks dataframe
for back positions, the gap distance (cars.start - trucks.end)
for front positions, the gap distance (trucks.start - cars.end)

The final cars dataframe should look like this:

   id  start  end  back_id  back_start  back_end  back_distance  across_id  across_start  across_end  front_id  front_start  front_end  front_distance
0  C1    200  215       T1         115       175             25         T2           200         260
1  C2    110  125       T4          25        85             25                                             T1          115        175             -10
2  C3    240  255                                                       T2           200         260        T3          280        340              25

Is pandas even the best tool for this task? If there is a better suited tool that is efficient at cross-referencing and appending columns based on some calculation across millions of rows, then I am all ears.

回答1:

so with pandas, you can use merge_asof, here is one way, maybe not efficient with millions of rows:

#first sort values
trucks = trucks.sort_values(['start'])
cars = cars.sort_values(['start'])

#create back condition
df_back = pd.merge_asof(trucks.rename(columns={col:f'back_{col}' 
                                               for col in trucks.columns}), 
                        cars.assign(back_end=lambda x: x['end']), 
                        on='back_end', direction='forward')\
            .query('end>back_end')\
            .assign(back_distance=lambda x: x['start']-x['back_end'])

#create across condition: here note that cars is the first of the 2 dataframes
df_across = pd.merge_asof(cars.assign(across_start=lambda x: x['start']),
                          trucks.rename(columns={col:f'across_{col}' 
                                                 for col in trucks.columns}), 
                          on=['across_start'], direction='backward')\
              .query('end<=across_end')

#create front condition
df_front = pd.merge_asof(trucks.rename(columns={col:f'front_{col}' 
                                                for col in trucks.columns}), 
                         cars.assign(front_start=lambda x: x['start']), 
                         on='front_start', direction='backward')\
             .query('start<front_start')\
             .assign(front_distance=lambda x: x['front_start']-x['end'])

# merge all back to cars
df_f = cars.merge(df_back, how='left')\
           .merge(df_across, how='left')\
           .merge(df_front, how='left')

and you get

print (df_f)
   id  start  end back_id  back_start  back_end  back_distance  across_start  \
0  C2    110  125      T4        25.0      85.0           25.0           NaN   
1  C1    200  215      T1       115.0     175.0           25.0         200.0   
2  C3    240  255     NaN         NaN       NaN            NaN         240.0   

  across_id  across_end front_id  front_start  front_end  front_distance  
0       NaN         NaN       T1        115.0      175.0           -10.0  
1        T2       260.0      NaN          NaN        NaN             NaN  
2        T2       260.0       T3        280.0      340.0            25.0

来源：https://stackoverflow.com/questions/60922730/using-pandas-to-identify-nearest-objects

标签

python

pandas