Get weighted average summary data column in new pandas dataframe from existing dataframe based on other column-ID

问题

Somewhat similar question to an earlier question I had here: Get summary data columns in new pandas dataframe from existing dataframe based on other column-ID However, instead of just taking the sum of datapoints, I wanted to have the weighted average in an extra column. I'll repeat and rephrase the question:

I want to summarize the data in a dataframe and add the new columns to another dataframe. My data contains appartments with an ID-number and it has surfaces and U-values for each room in the appartment. What I want is having a dataframe that summarizes this and gives me the total surface and surface-weighted average U-value per appartment. There are three conditions for the original dataframe:

Three conditions:

the dataframe can contain empty cells
when the values of surface or U-value are equal for all of the rows within that ID (so all the same values for the same ID), then the data (surface, volumes) is not summed but one value/row is passed to the new summary column (example: 'ID 4')(as this could be a mistake in the original dataframe and the total surface/volume was inserted for all the rooms by the government-employee)
the average U-value should be the Surface-weighted average U-value

Initial dataframe 'data':

print(data)
    ID  Surface  U-value
0    2     10.0      1.0
1    2     12.0      1.0
2    2     24.0      0.5
3    2      8.0      1.0
4    4     84.0      0.8
5    4     84.0      0.8
6    4     84.0      0.8
7   52      NaN      0.2
8   52     96.0      1.0
9   95      8.0      2.0
10  95      6.0      2.0
11  95     12.0      2.0
12  95     30.0      1.0
13  95     12.0      1.5

Desired output from 'df':

print(df)

    ID  Surface  U-value  #-> U-value = surface weighted U-value!; Surface = sum of all surfaces except when all surfaces per ID are the same (example 'ID 4')
0    2     54.0   0.777
1    4     84.0   0.8     #-> as the values are the same for each row of this ID in the original data, the sum is not taken, but only one of the rows is passed (see the second condition)
2   52     96.0   1.0     # -> as one of 2 surface is empty, the corresponding U-value for the empty cell is ignored, so the output here should be the weighted average of the values that have both 'Surface'&'U-value'-values (in this case 1,0)
3   95     68.0   1.47

The code of jezrael in the reference already works brilliant for the sum() but how to add a weighted average 'U-value'-column to it? I really have no idea. An average could be set with a mean()-function instead of the sum() but the weighted-average..?

import pandas as pd
import numpy as np

df = pd.DataFrame({"ID": [2,4,52,95]})    

data = pd.DataFrame({"ID": [2,2,2,2,4,4,4,52,52,95,95,95,95,95],                    
                "Surface": [10,12,24,8,84,84,84,np.nan,96,8,6,12,30,12],     
                "U-value": 
[1.0,1.0,0.5,1.0,0.8,0.8,0.8,0.2,1.0,2.0,2.0,2.0,1.0,1.5]})    
print(data)

cols = ['Surface']
m1 = data.groupby("ID")[cols].transform('nunique').eq(1)
m2 = data[cols].apply(lambda x: x.to_frame().join(data['ID']).duplicated())

df = data[cols].mask(m1 & m2).groupby(data["ID"]).sum().reset_index()
print(df)

回答1:

This should do the trick:

data.groupby('ID').apply(lambda g: (g['U-value']*g['Surface']).sum() / g['Surface'].sum())

To add to original dataframe, don't reset the index first:

df = data[cols].mask(m1 & m2).groupby(data["ID"]).sum()
df['U-value'] = data.groupby('ID').apply(
    lambda g: (g['U-value'] * g['Surface']).sum() / g['Surface'].sum())
df.reset_index(inplace=True)

The result:

   ID  Surface   U-value
0   2     54.0  0.777778
1   4     84.0  0.800000
2  52     96.0  1.000000
3  95     68.0  1.470588

来源：https://stackoverflow.com/questions/61302339/get-weighted-average-summary-data-column-in-new-pandas-dataframe-from-existing-d

标签

python

pandas

group-by

weighted-average