问题
I have a report that identifies key drivers of an overall number/trend. I would like to automate the functionality to be able to list/identify the underlying records based on a percentage of that number. For example if the net change for sales of widgets in the south(region) is -5,000.00, but there are positives and negatives- I would like to identify at least ~90% (-4,500.00) of all underlying drivers that make up that -5,000.00 total from largest to smallest.
data
region OfficeLocation sales
South 1 -500
South 2 300
South 3 -1000
South 4 -2000
South 5 300
South 6 -700
South 7 -400
South 8 800
North 11 300
North 22 -400
North 33 1000
North 44 800
North 55 900
North 66 -800
for South, the total sales is -3200. I would like to identify/list the drivers that make up at least 90% of this move(in descending order)- so 90% of -3200 would be 2880. And the directional moves/sales for South office 3 & 4 = -3000 would be the output for this request:
region OfficeLocation sales
South 3 -1000
South 4 -2000
for North, the total sales is +1800. I would like to identify/list the drivers that make up at least 90% of this move(in descending order)- so at least 90% of 1800 would be 1620. And the directional moves/sales for South office 3 & 4 = -3000 would be the output for this request:
region OfficeLocation sales North 33 1000 North 44 800
Dataset above has both positive and negative trends for south/north. Any help you can provide would be greatly appreciated!
回答1:
As mentioned in the comment, it isn't clear what to do in the 'North'
case as the sum is positive there, but ignoring that, you could do something like the following:
In [200]: df[df.groupby('region').sales.apply(lambda g: g <= g.loc[(g.sort_values().cumsum() > 0.9*g.sum()).idxmin()])]
Out[200]:
region OfficeLocation sales
2 South 3 -1000
3 South 4 -2000
13 North 66 -800
If, in the positive case, you want to find as few elements as possible that together have the property that they make up 90% of the sum of the sales, the above solution can be adopted as follows:
def is_driver(group):
s = group.sum()
if s > 0:
group *= -1
s *= -1
a = group.sort_values().cumsum() > 0.9*s
return group <= group.loc[a.idxmin()]
In [168]: df[df.groupby('region').sales.apply(is_driver)]
Out[168]:
region OfficeLocation sales
2 South 3 -1000
3 South 4 -2000
10 North 33 1000
12 North 55 900
Note that in the case of a tie, only one element is picked out.
来源:https://stackoverflow.com/questions/52682394/identify-records-that-make-up-90-of-total