问题
Currently, I have two dataframes where I am merging on 'KEY'. My first dataframe contains a KEY and the original price of a product. My second dataframe collects information for each time a person makes a payment. I need to create a final calculated column in df1 which shows the remaining balance. The remaining balance is calculated by subtracting payment_price from the original_price. The only caveat is that only certain price_codes reflect a payment (13, 14 and 15).
I'm not sure if the best approach utilizes merges or if I can simply refer to another df without having to merge (the latter approach would seem more ideal since both dfs have 500,000,000+ rows), but I can't find much content on this specific scenario.
df1 = pd.DataFrame({'KEY': ['100000555', '100000009','100000034','100000035', '100000036'],
'original_price': [1205.20,1253.25,1852.15,1452.36,1653.21],
'area': [12, 13, 12,12,12]})
df2 = pd.DataFrame({'KEY': ['100000555', '100000009', '100000009', '100000009', '100000009','100000034','100000034', '100000034'],
'payment_price': [134.04, 453.43, 422.32,23.23,10.43,10.47,243.09,23.45],
'Price_code': ['13', '13', '14','15','16','13','14','15']})
df1:
KEY area original_price
0 100000555 12 1205.20
1 100000009 13 1253.25
2 100000034 12 1852.15
3 100000035 12 1452.36
4 100000036 12 1653.21
df2:
KEY payment_price Price_code
0 100000555 134.04 13
1 100000009 453.43 13
2 100000009 422.32 14
3 100000009 23.23 15
4 100000009 10.43 16
5 100000034 10.47 13
6 100000034 243.09 14
7 100000034 23.45 15
I need to create a calculation where I need to subtract any payment_price from df2 if they match the key and have price_code values of 13,14, or 15.
final result
KEY area original_price calculated_price
0 100000555 12 1205.20 1071.16 # (1205.20 - 134.04)
1 100000009 13 1253.25 354.27 # (1253.25 - 453.43 - 422.32 - 23.23)
2 100000034 12 1852.15 1575.14 # (1852.15 - 10.47 - 243.09 - 23.45)
3 100000035 12 1452.36 1452.36
4 100000036 12 1653.21 1653.21
My initial inclination was to merge the two dfs and perform the calculation with a groupby statement. But my hesitation with this is that this seems resource heavy and my final df will be at least double the amount of rows. Additionally, I am running into a mental block to write the calculation to only include certain price_codes. So now I'm wondering if there is a better approach. I'm open to other approaches or help with this script. I will be honest in that I'm not entirely sure how to write the the conditionals for the price_codes for something like this. The code below first merges the dfs, then creates a column (remaining_price). However, for KEY 10000009 I need to include only the price_codes 12, 14, 15 and exclude 16, however 16 is currently included.
result = pd.merge(df1, df2,how='left', on='KEY')
codes = [13,14,15]
result['remaining_price'] = result['original_price'] - result['payment_price'].groupby(result['KEY']).transform('sum')
Finally, I assume if this is the approach I use, that I would need to drop all duplicate rows on KEY and the two merged columns (price_code, payment_price).
result = result.drop_duplicates(subset=['KEY'],keep='first')
回答1:
Here is one way. There is no need for an explicit merge or to drop duplicates. This is where you might see a performance improvement.
Solution
s = df2[df2['Price_code'].isin([13, 14, 15])].groupby('KEY')['payment_price'].sum()
df1['calculated_price'] = df1['original_price'] - df1['KEY'].map(s).fillna(0)
Result
KEY area original_price calculated_price
0 100000555 12 1205.20 1071.16
1 100000009 13 1253.25 354.27
2 100000034 12 1852.15 1575.14
3 100000035 12 1452.36 1452.36
4 100000036 12 1653.21 1653.21
Explanation
- Filter
df2by Price_code as required, aggregate payment_price by KEY and finally sum. The result is a series mapping KEY to sum of payments. - Use
mapto map these summations to KEY indf1and subtract from original_price.
回答2:
from dask import delayed
# Use this function for parallel computing using Dask
@delayed
def calc_price(df1, df2):
""" Calculate original_price - payment_price """
df3 = (df2[df2['Price_code'] != '16'].groupby('KEY')['payment_price'].sum()).reset_index()
df1 = df1.merge(df3, how='left', on='KEY').fillna(0)
df1['calculated_price'] = df1['original_price'].sub( df1['payment_price'])
return df1
df1 = calc_price(df1, df2).compute()
来源:https://stackoverflow.com/questions/49059296/using-two-dataframes-to-calculate-final-value-pandas