Sort Values in DataFrame using Categorical Key without groupby Split Apply Combine

问题

So... I have a Dataframe that looks like this, but much larger:

    DATE        ITEM    STORE   STOCK
0   2018-06-06     A    L001    4
1   2018-06-06     A    L002    0
2   2018-06-06     A    L003    4
3   2018-06-06     B    L001    1
4   2018-06-06     B    L002    2

You can reproduce the same DataFrame with the following code:

import pandas as pd
import numpy as np
import itertools as it

lojas = ['L001', 'L002', 'L003']
itens = list("ABC")
dr = pd.date_range(start='2018-06-06', end='2018-06-12')

df = pd.DataFrame(data=list(it.product(dr, itens, lojas)), columns=['DATE', 'ITEM', 'STORE'])
df['STOCK'] = np.random.randint(0,5, size=len(df.ITEM))

I wanna calculate de STOCK difference between days in every pair ITEM-STORE and iterating over groups in a groupby object is easy using the function .diff() to get something like this:

    DATE       ITEM     STORE   STOCK   DELTA
0   2018-06-06    A     L001    4        NaN
9   2018-06-07    A     L001    0       -4.0
18  2018-06-08    A     L001    4        4.0
27  2018-06-09    A     L001    0       -4.0
36  2018-06-10    A     L001    3        3.0
45  2018-06-11    A     L001    2       -1.0
54  2018-06-12    A     L001    2        0.0

I´ve manage to do so by the following code:

gg = df.groupby([df.ITEM, df.STORE])
lg = []

for (name, group) in gg:
    aux = group.copy()
    aux.reset_index(drop=True, inplace=True)
    aux['DELTA'] = aux.STOCK.diff().fillna(value=0, inplace=Tr

    lg.append(aux)

df = pd.concat(lg)

But in a large DataFrame, it gets impracticable. Is there a faster more pythonic way to do this task?

回答1:

I've tried to improve your groupby code, so this should be a lot faster.

v = df.groupby(['ITEM', 'STORE'], sort=False).STOCK.diff()
df['DELTA'] = np.where(np.isnan(v), 0, v)

Some pointers/ideas here:

Don't iterate over groups
Don't pass series as the groupers if the series belong to the same DataFrame. Pass string labels instead.
diff can be vectorized
The last line is tantamount to a fillna, but fillna is slower than np.where
Specifying sort=False will prevent the output from being sorted by grouper keys, improving performance further

This can also be re-written as

df['DELTA'] = df.groupby(['ITEM', 'STORE'], sort=False).STOCK.diff().fillna(0)

来源：https://stackoverflow.com/questions/50727342/sort-values-in-dataframe-using-categorical-key-without-groupby-split-apply-combi

标签

python

pandas

pandas-groupby