问题
I've a csv like this:
client1,client2,client3,client4,client5,client6,amount
,,,Comp1,,,4.475000
,,,Comp2,,,16.305584
,,,Comp3,,,4.050000
Comp2,Comp1,,Comp4,,,21.000000
,,,Comp4,,,30.000000
,Comp1,,Comp2,,,5.137500
,,,Comp3,,,52.650000
,,,Comp1,,,2.650000
Comp3,,,Comp3,,,29.000000
Comp5,,,Comp2,,,20.809000
Comp5,,,Comp2,,,15.100000
Comp5,,,Comp2,,,52.404000
After reading it into a pandas dataframe, df, I wanted to aggregate in two steps:
Step1:
First, I sum the amount:
client1 client2 client3 client4 client5 client6 amount
Comp1 7.125000
Comp2 16.305584
Comp3 56.700000
Comp4 30.000000
Comp1 Comp2 5.137500
Comp2 Comp1 Comp4 21.000000
Comp3 Comp3 29.000000
Comp5 Comp2 88.313000
Then, I want to aggregate by each client name such that if multiple clients are involved like in group 5, then 5.1375 must be split equally between Comp1 and Comp2. Tried it this way:
df.groupby(['client1','client2','client3','client4','client5','client6']).apply(lambda x: x['amount'].sum()/len(x) if x.any().nunique()>=1 else x['amount'].sum())
client1 client2 client3 client4 client5 client6 0
0 Comp1 3.562500
1 Comp2 16.305584
2 Comp3 28.350000
3 Comp4 30.000000
4 Comp1 Comp2 5.137500
5 Comp2 Comp1 Comp4 21.000000
6 Comp3 Comp3 29.000000
7 Comp5 Comp2 29.437667
Expected Output is:
Client Amount
Comp1 4.475+21/3+5.1375/2+2.65 = 16.69375
Comp2 16.305584+21/3+20.809/2+15.10/2+52.404/2 = 67.462084
Comp3 4.05+52.65+29 = 85.7
Comp4 21/3+30 = 37
Comp5 20.809/2+15.10/2+52.404/2 = 44.1565
I tried using sum(axis=0)
but of no use.
回答1:
We can use a bit a maths here
cols = ['amount']
# Divide the amount by non null fields
df['new'] = df['amount']/df.drop(cols,1).notnull().sum(1)
#Set the index as new by droping amount column, unstack and drop the nans.
x = df.drop(cols,1).set_index('new').unstack().dropna()
#Create dataframe just from amount and the clients
ndf = pd.DataFrame({'amount':x.index.droplevel(0).values,'clients':x.values})
#Groupby client and get the sum
ndf.groupby('clients').sum()
Output:
amount clients Comp1 16.360417 Comp2 69.697501 Comp3 85.700000 Comp4 36.666667 Comp5 44.156500
回答2:
I'd organize it like this:
d = df.drop('amount', 1) # new df without `amount`
a = df.amount # separate series of `amount`
c = d.count(1) # count of non-null values
a.div(c).repeat(c).groupby(d.stack().values).sum()
Comp1 16.693750
Comp2 70.030834
Comp3 85.700000
Comp4 37.000000
Comp5 44.156500
dtype: float64
来源:https://stackoverflow.com/questions/46995997/grouping-pandas-dataframe-across-rows