问题
- I am working with two csv files and imported as dataframe, df1 and df2
- df1 has 50000 rows and df2 has 150000 rows.
- I want to compare (iterate through each row) the 'time' of df2 with df1, find the difference in time and return the values of all column corresponding to similar row, save it in df3 (time synchronization)
- For example, 35427949712 (of 'time' in df1) is nearest or equal to 35427949712 (of 'time' in df2), So I would like to return the contents to df1 ('velocity_x' and 'yaw') and df2 ('velocity' and 'yawrate') and save in df3
- For this i used two techniques, shown in code.
- Code 1 takes very long time to execute 72 hours which is not practice since i have lot of csv files
- Code 2 gives me "memory error" and kernel dies.
Would be great, if I get a more robust solution for the problem considering computational time, memory and power(Intel Core i7-6700HQ, 8 GB Ram)
Here is the sample data,
import pandas as pd
df1 = pd.DataFrame({'time': [35427889701, 35427909854, 35427929709,35427949712, 35428009860],
'velocity_x':[12.5451, 12.5401,12.5351,12.5401,12.5251],
'yaw' : [-0.0787806, -0.0784749, -0.0794889,-0.0795915,-0.0795472]})
df2 = pd.DataFrame({'time': [35427929709, 35427949712, 35427009860,35427029728, 35427049705],
'velocity':[12.6583, 12.6556,12.6556,12.6556,12.6444],
'yawrate' : [-0.0750492, -0.0750492, -0.074351,-0.074351,-0.074351]})
df3 = pd.DataFrame(columns=['time','velocity_x','yaw','velocity','yawrate'])
Code1
for index, row in df1.iterrows():
min=100000
for indexer, rows in df2.iterrows():
if abs(float(row['time'])-float(rows['time']))<min:
min = abs(float(row['time'])-float(rows['time']))
#storing the position
pos = indexer
df3.loc[index,'time'] = df1['time'][pos]
df3.loc[index,'velocity_x'] = df1['velocity_x'][pos]
df3.loc[index,'yaw'] = df1['yaw'][pos]
df3.loc[index,'velocity'] = df2['velocity'][pos]
df3.loc[index,'yawrate'] = df2['yawrate'][pos]
Code2
df1['key'] = 1
df2['key'] = 1
df1.rename(index=str, columns ={'time' : 'time_x'}, inplace=True)
df = df2.merge(df1, on='key', how ='left').reset_index()
df['diff'] = df.apply(lambda x: abs(x['time'] - x['time_x']), axis=1)
df.sort_values(by=['time', 'diff'], inplace=True)
df=df.groupby(['time']).first().reset_index()[['time', 'velocity_x', 'yaw', 'velocity', 'yawrate']]
回答1:
You're looking for pandas.merge_asof. It allows you to combine 2 DataFrame
s on a key, in this case time
, without the requirement that they are an exact match. You can choose a direction
for prioritizing the match, but in this case it's obvious that you want nearest
A “nearest” search selects the row in the right DataFrame whose ‘on’ key is closest in absolute distance to the left’s key.
One caveat is that you need to sort things for merge_asof
to work.
import pandas as pd
pd.merge_asof(df2.sort_values('time'), df1.sort_values('time'), on='time', direction='nearest')
# time velocity yawrate velocity_x yaw
#0 35427009860 12.6556 -0.074351 12.5451 -0.078781
#1 35427029728 12.6556 -0.074351 12.5451 -0.078781
#2 35427049705 12.6444 -0.074351 12.5451 -0.078781
#3 35427929709 12.6583 -0.075049 12.5351 -0.079489
#4 35427949712 12.6556 -0.075049 12.5401 -0.079591
Just be careful about which DataFrame
you choose as the left or right frame, as that changes the result. In this case I'm selecting the time
in df1
which is closest in absolute distance to the time
in df2
.
You also need to be careful if you have duplicated on
keys in the right df
because for exact matches, merge_asof
only merges the last sorted row of the right df
to the left df
, instead of creating multiple entries for each exact match. If that's a problem, you can instead merge the exact keys first to get all of the combinations, and then merge the remainder with asof.
回答2:
just a side note (as not an answer)
min_delta=100000
for indexer, rows in df2.iterrows():
if abs(float(row['time'])-float(rows['time']))<min_delta:
min_delta = abs(float(row['time'])-float(rows['time']))
#storing the position
pos = indexer
can be written as
diff = np.abs(row['time'] - df2['time'])
pos = np.argmin(diff)
(always avoid for loops)
and don't call your vars with a built-in name (min
)
来源:https://stackoverflow.com/questions/50434160/python-pandas-compare-two-data-frames-along-one-column-and-return-content-of-r