Pandas percentage of total with groupby

前端未结

关注

 14  2505

没有蜡笔的小新 2020-11-22 06:41

This is obviously simple, but as a numpy newbe I\'m getting stuck.

I have a CSV file that contains 3 columns, the State, the Office ID, and the Sales for that office

14条回答

甜味超标 (楼主)

2020-11-22 07:05
I think this needs benchmarking. Using OP's original DataFrame,
```
df = pd.DataFrame({
    'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
    'office_id': range(1, 7) * 2,
    'sales': [np.random.randint(100000, 999999) for _ in range(12)]
})
```
1st Andy Hayden

As commented on his answer, Andy takes full advantage of vectorisation and pandas indexing.
```
c = df.groupby(['state', 'office_id'])['sales'].sum().rename("count")
c / c.groupby(level=0).sum()
```
3.42 ms ± 16.7 µs per loop
(mean ± std. dev. of 7 runs, 100 loops each)

2nd Paul H
```
state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
state = df.groupby(['state']).agg({'sales': 'sum'})
state_office.div(state, level='state') * 100
```
4.66 ms ± 24.4 µs per loop
(mean ± std. dev. of 7 runs, 100 loops each)

3rd exp1orer

This is the slowest answer as it calculates x.sum() for each x in level 0.

For me, this is still a useful answer, though not in its current form. For quick EDA on smaller datasets, apply allows you use method chaining to write this in a single line. We therefore remove the need decide on a variable's name, which is actually very computationally expensive for your most valuable resource (your brain!!).

Here is the modification,
```
(
    df.groupby(['state', 'office_id'])
    .agg({'sales': 'sum'})
    .groupby(level=0)
    .apply(lambda x: 100 * x / float(x.sum()))
)
```
10.6 ms ± 81.5 µs per loop
(mean ± std. dev. of 7 runs, 100 loops each)

So no one is going care about 6ms on a small dataset. However, this is 3x speed up and, on a larger dataset with high cardinality groupbys this is going to make a massive difference.

Adding to the above code, we make a DataFrame with shape (12,000,000, 3) with 14412 state categories and 600 office_ids,
```
import string

import numpy as np
import pandas as pd
np.random.seed(0)

groups = [
    ''.join(i) for i in zip(
    np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
    np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
    np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
                       )
]

df = pd.DataFrame({'state': groups * 400,
               'office_id': list(range(1, 601)) * 20000,
               'sales': [np.random.randint(100000, 999999)
                         for _ in range(12)] * 1000000
})
```
Using Andy's,

2 s ± 10.4 ms per loop
(mean ± std. dev. of 7 runs, 1 loop each)

and exp1orer

19 s ± 77.1 ms per loop
(mean ± std. dev. of 7 runs, 1 loop each)

So now we see x10 speed up on large, high cardinality datasets.

Be sure to UV these three answers if you UV this one!!
0 讨论(0)

查看其它14个回答
发布评论:

提交评论
- 加载中...

Pandas percentage of total with groupby

1st Andy Hayden

2nd Paul H

3rd exp1orer