identify records that make up 90% of total

问题

I have a report that identifies key drivers of an overall number/trend. I would like to automate the functionality to be able to list/identify the underlying records based on a percentage of that number. For example if the net change for sales of widgets in the south(region) is -5,000.00, but there are positives and negatives- I would like to identify at least ~90% (-4,500.00) of all underlying drivers that make up that -5,000.00 total from largest to smallest.

data

region    OfficeLocation  sales
South     1                -500
South     2                300
South     3                -1000
South     4                -2000
South     5                 300
South     6                -700
South     7                -400
South     8                 800
North     11                300
North     22               -400
North     33                1000
North     44                800
North     55                900
North     66                -800

for South, the total sales is -3200. I would like to identify/list the drivers that make up at least 90% of this move(in descending order)- so 90% of -3200 would be 2880. And the directional moves/sales for South office 3 & 4 = -3000 would be the output for this request:

region    OfficeLocation  sales
South     3                -1000
South     4                -2000

for North, the total sales is +1800. I would like to identify/list the drivers that make up at least 90% of this move(in descending order)- so at least 90% of 1800 would be 1620. And the directional moves/sales for South office 3 & 4 = -3000 would be the output for this request:

region OfficeLocation sales North 33 1000 North 44 800

Dataset above has both positive and negative trends for south/north. Any help you can provide would be greatly appreciated!

回答1:

As mentioned in the comment, it isn't clear what to do in the 'North' case as the sum is positive there, but ignoring that, you could do something like the following:

In [200]: df[df.groupby('region').sales.apply(lambda g: g <= g.loc[(g.sort_values().cumsum() > 0.9*g.sum()).idxmin()])]
Out[200]:
   region  OfficeLocation  sales
2   South               3  -1000
3   South               4  -2000
13  North              66   -800

If, in the positive case, you want to find as few elements as possible that together have the property that they make up 90% of the sum of the sales, the above solution can be adopted as follows:

def is_driver(group):
    s = group.sum()
    if s > 0:
        group *= -1
        s *= -1
    a = group.sort_values().cumsum() > 0.9*s
    return group <= group.loc[a.idxmin()]

In [168]: df[df.groupby('region').sales.apply(is_driver)]
Out[168]:
   region  OfficeLocation  sales
2   South               3  -1000
3   South               4  -2000
10  North              33   1000
12  North              55    900

Note that in the case of a tie, only one element is picked out.

来源：https://stackoverflow.com/questions/52682394/identify-records-that-make-up-90-of-total

标签

python

pandas

pandas-groupby