Using two dataframes to calculate final value pandas

妖精的绣舞 提交于 2020-02-05 00:24:17

问题


Currently, I have two dataframes where I am merging on 'KEY'. My first dataframe contains a KEY and the original price of a product. My second dataframe collects information for each time a person makes a payment. I need to create a final calculated column in df1 which shows the remaining balance. The remaining balance is calculated by subtracting payment_price from the original_price. The only caveat is that only certain price_codes reflect a payment (13, 14 and 15).

I'm not sure if the best approach utilizes merges or if I can simply refer to another df without having to merge (the latter approach would seem more ideal since both dfs have 500,000,000+ rows), but I can't find much content on this specific scenario.

df1 = pd.DataFrame({'KEY': ['100000555', '100000009','100000034','100000035', '100000036'], 
              'original_price': [1205.20,1253.25,1852.15,1452.36,1653.21],
              'area': [12, 13, 12,12,12]})
df2 = pd.DataFrame({'KEY': ['100000555', '100000009', '100000009', '100000009', '100000009','100000034','100000034', '100000034'], 
              'payment_price': [134.04, 453.43, 422.32,23.23,10.43,10.47,243.09,23.45],
              'Price_code': ['13', '13', '14','15','16','13','14','15']})

df1:

    KEY         area    original_price
0   100000555   12      1205.20
1   100000009   13      1253.25
2   100000034   12      1852.15
3   100000035   12      1452.36
4   100000036   12      1653.21

df2:

    KEY         payment_price    Price_code
0   100000555   134.04           13
1   100000009   453.43           13
2   100000009   422.32           14
3   100000009   23.23            15
4   100000009   10.43            16
5   100000034   10.47            13
6   100000034   243.09           14
7   100000034   23.45            15

I need to create a calculation where I need to subtract any payment_price from df2 if they match the key and have price_code values of 13,14, or 15.

final result

    KEY         area    original_price    calculated_price
0   100000555   12      1205.20           1071.16          # (1205.20 - 134.04)
1   100000009   13      1253.25           354.27           # (1253.25 - 453.43 - 422.32 - 23.23)
2   100000034   12      1852.15           1575.14          # (1852.15 - 10.47 - 243.09 - 23.45)
3   100000035   12      1452.36           1452.36
4   100000036   12      1653.21           1653.21

My initial inclination was to merge the two dfs and perform the calculation with a groupby statement. But my hesitation with this is that this seems resource heavy and my final df will be at least double the amount of rows. Additionally, I am running into a mental block to write the calculation to only include certain price_codes. So now I'm wondering if there is a better approach. I'm open to other approaches or help with this script. I will be honest in that I'm not entirely sure how to write the the conditionals for the price_codes for something like this. The code below first merges the dfs, then creates a column (remaining_price). However, for KEY 10000009 I need to include only the price_codes 12, 14, 15 and exclude 16, however 16 is currently included.

result = pd.merge(df1, df2,how='left', on='KEY')

codes = [13,14,15]
result['remaining_price'] = result['original_price'] - result['payment_price'].groupby(result['KEY']).transform('sum')

Finally, I assume if this is the approach I use, that I would need to drop all duplicate rows on KEY and the two merged columns (price_code, payment_price).

result = result.drop_duplicates(subset=['KEY'],keep='first')

回答1:


Here is one way. There is no need for an explicit merge or to drop duplicates. This is where you might see a performance improvement.

Solution

s = df2[df2['Price_code'].isin([13, 14, 15])].groupby('KEY')['payment_price'].sum()

df1['calculated_price'] = df1['original_price'] - df1['KEY'].map(s).fillna(0)

Result

         KEY  area  original_price  calculated_price
0  100000555    12         1205.20           1071.16
1  100000009    13         1253.25            354.27
2  100000034    12         1852.15           1575.14
3  100000035    12         1452.36           1452.36
4  100000036    12         1653.21           1653.21

Explanation

  • Filter df2 by Price_code as required, aggregate payment_price by KEY and finally sum. The result is a series mapping KEY to sum of payments.
  • Use map to map these summations to KEY in df1 and subtract from original_price.



回答2:


from dask import delayed

# Use this function for parallel computing using Dask
@delayed
def calc_price(df1, df2):
    """ Calculate original_price - payment_price """

    df3 = (df2[df2['Price_code'] != '16'].groupby('KEY')['payment_price'].sum()).reset_index()
    df1 = df1.merge(df3, how='left', on='KEY').fillna(0)
    df1['calculated_price'] = df1['original_price'].sub( df1['payment_price'])

    return df1

df1 = calc_price(df1, df2).compute()


来源:https://stackoverflow.com/questions/49059296/using-two-dataframes-to-calculate-final-value-pandas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!