Pandas percentage of total with groupby

前端 未结 14 2505
没有蜡笔的小新
没有蜡笔的小新 2020-11-22 06:41

This is obviously simple, but as a numpy newbe I\'m getting stuck.

I have a CSV file that contains 3 columns, the State, the Office ID, and the Sales for that office

14条回答
  •  甜味超标
    2020-11-22 07:05

    I think this needs benchmarking. Using OP's original DataFrame,

    df = pd.DataFrame({
        'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
        'office_id': range(1, 7) * 2,
        'sales': [np.random.randint(100000, 999999) for _ in range(12)]
    })
    

    1st Andy Hayden

    As commented on his answer, Andy takes full advantage of vectorisation and pandas indexing.

    c = df.groupby(['state', 'office_id'])['sales'].sum().rename("count")
    c / c.groupby(level=0).sum()
    

    3.42 ms ± 16.7 µs per loop
    (mean ± std. dev. of 7 runs, 100 loops each)


    2nd Paul H

    state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
    state = df.groupby(['state']).agg({'sales': 'sum'})
    state_office.div(state, level='state') * 100
    

    4.66 ms ± 24.4 µs per loop
    (mean ± std. dev. of 7 runs, 100 loops each)


    3rd exp1orer

    This is the slowest answer as it calculates x.sum() for each x in level 0.

    For me, this is still a useful answer, though not in its current form. For quick EDA on smaller datasets, apply allows you use method chaining to write this in a single line. We therefore remove the need decide on a variable's name, which is actually very computationally expensive for your most valuable resource (your brain!!).

    Here is the modification,

    (
        df.groupby(['state', 'office_id'])
        .agg({'sales': 'sum'})
        .groupby(level=0)
        .apply(lambda x: 100 * x / float(x.sum()))
    )
    

    10.6 ms ± 81.5 µs per loop
    (mean ± std. dev. of 7 runs, 100 loops each)


    So no one is going care about 6ms on a small dataset. However, this is 3x speed up and, on a larger dataset with high cardinality groupbys this is going to make a massive difference.

    Adding to the above code, we make a DataFrame with shape (12,000,000, 3) with 14412 state categories and 600 office_ids,

    import string
    
    import numpy as np
    import pandas as pd
    np.random.seed(0)
    
    groups = [
        ''.join(i) for i in zip(
        np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
        np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
        np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
                           )
    ]
    
    df = pd.DataFrame({'state': groups * 400,
                   'office_id': list(range(1, 601)) * 20000,
                   'sales': [np.random.randint(100000, 999999)
                             for _ in range(12)] * 1000000
    })
    

    Using Andy's,

    2 s ± 10.4 ms per loop
    (mean ± std. dev. of 7 runs, 1 loop each)

    and exp1orer

    19 s ± 77.1 ms per loop
    (mean ± std. dev. of 7 runs, 1 loop each)

    So now we see x10 speed up on large, high cardinality datasets.


    Be sure to UV these three answers if you UV this one!!

提交回复
热议问题