lookup string values in lookup table to populate second dataframe

▼魔方 西西 提交于 2019-12-11 01:43:43

问题


I have two dataframes, main_df:

  | header_1
0 | value_1
1 | value_2
2 | value_3
3 | value_1

And a lookup dataframe lookup_df:

  | header_1 | header_2
0 | value_1 | lookup_value_1
1 | value_2 | lookup_value_2
2 | value_3 | lookup_value_3
3 | value_4 | lookup_value_4

The values in main_df are not unique. The values in `lookup_df' are unique.

I simply want to populate a new column in main df with the corresponding lookup_value from lookup_df.

Have tried various approaches including .merge, .join, .map and .lookup.

main_df = pd.merge(main_df, lookup_df, how='inner', on=['header_1'])

The outcome I am looking for is:

  | header_1 | header_2
0 | value_1 | lookup_value_1
1 | value_2 | lookup_value_2
2 | value_3 | lookup_value_3
3 | value_1 | lookup_value_1

回答1:


You can use map by Series:

main_df['header_2'] = main_df['header_1'].map(lookup_df.set_index('header_1')['header_2'])
print (main_df)
  header_1        header_2
0  value_1  lookup_value_1
1  value_2  lookup_value_2
2  value_3  lookup_value_3
3  value_1  lookup_value_1

Or a bit faster is convert Series to_dict:

main_df['header_2'] = main_df['header_1'].map(lookup_df.set_index('header_1')['header_2']
                                                       .to_dict())
print (main_df)
  header_1        header_2
0  value_1  lookup_value_1
1  value_2  lookup_value_2
2  value_3  lookup_value_3
3  value_1  lookup_value_1

Timings:

#[400000 rows x 1 columns]
main_df = pd.concat([main_df]*100000).reset_index(drop=True)

In [139]: %timeit pd.merge(main_df, lookup_df, how='left', on=['header_1'])
10 loops, best of 3: 73.1 ms per loop

In [140]: %timeit main_df['header_1'].map(lookup_df.set_index('header_1')['header_2'])
10 loops, best of 3: 35.7 ms per loop

In [141]: %timeit main_df['header_1'].map(lookup_df.set_index('header_1')['header_2'].to_dict())
10 loops, best of 3: 35.1 ms per loop

EDIT:

You need unique values of column header_1 in lookup_df, one possible solution is drop_duplicates:

print (lookup_df)
  header_1        header_2
0  value_1  lookup_value_1
1  value_2  lookup_value_2
2  value_3  lookup_value_3
3  value_1  lookup_value_4

#keep first value, default parameter keep='first'
lookup_df = lookup_df.drop_duplicates(['header_1'])
print (lookup_df)
  header_1        header_2
0  value_1  lookup_value_1
1  value_2  lookup_value_2
2  value_3  lookup_value_3

#keep last value
lookup_df1 = lookup_df.drop_duplicates(['header_1'], keep='last')
print (lookup_df1)
  header_1        header_2
0  value_1  lookup_value_1
1  value_2  lookup_value_2
2  value_3  lookup_value_3



回答2:


You have to do a merge without the 'how' keyword. Like so:

main_df = pd.DataFrame([{'header_1': 'value_1'},{'header_1': 'value_2'},{'header_1': 'value_3'},{'header_1': 'value_1'}])

lookup_df = pd.DataFrame([{'header_1':'value_1', 'header_2':'lookup_value_1'}, {'header_1':'value_2', 'header_2':'lookup_value_2'}, {'header_1':'value_3', 'header_2':'lookup_value_3'}, {'header_1':'value_4', 'header_2':'lookup_value_4'}])

main_df = pd.merge(main_df, lookup_df, on='header_1')

The output is

  header_1        header_2
0  value_1  lookup_value_1
1  value_1  lookup_value_1
2  value_2  lookup_value_2
3  value_3  lookup_value_3


来源:https://stackoverflow.com/questions/41806079/lookup-string-values-in-lookup-table-to-populate-second-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!