Sort Values in DataFrame using Categorical Key without groupby Split Apply Combine

流过昼夜 提交于 2020-01-06 06:03:31

问题


So... I have a Dataframe that looks like this, but much larger:

    DATE        ITEM    STORE   STOCK
0   2018-06-06     A    L001    4
1   2018-06-06     A    L002    0
2   2018-06-06     A    L003    4
3   2018-06-06     B    L001    1
4   2018-06-06     B    L002    2

You can reproduce the same DataFrame with the following code:

import pandas as pd
import numpy as np
import itertools as it

lojas = ['L001', 'L002', 'L003']
itens = list("ABC")
dr = pd.date_range(start='2018-06-06', end='2018-06-12')

df = pd.DataFrame(data=list(it.product(dr, itens, lojas)), columns=['DATE', 'ITEM', 'STORE'])
df['STOCK'] = np.random.randint(0,5, size=len(df.ITEM))

I wanna calculate de STOCK difference between days in every pair ITEM-STORE and iterating over groups in a groupby object is easy using the function .diff() to get something like this:

    DATE       ITEM     STORE   STOCK   DELTA
0   2018-06-06    A     L001    4        NaN
9   2018-06-07    A     L001    0       -4.0
18  2018-06-08    A     L001    4        4.0
27  2018-06-09    A     L001    0       -4.0
36  2018-06-10    A     L001    3        3.0
45  2018-06-11    A     L001    2       -1.0
54  2018-06-12    A     L001    2        0.0

I´ve manage to do so by the following code:

gg = df.groupby([df.ITEM, df.STORE])
lg = []

for (name, group) in gg:
    aux = group.copy()
    aux.reset_index(drop=True, inplace=True)
    aux['DELTA'] = aux.STOCK.diff().fillna(value=0, inplace=Tr

    lg.append(aux)

df = pd.concat(lg) 

But in a large DataFrame, it gets impracticable. Is there a faster more pythonic way to do this task?


回答1:


I've tried to improve your groupby code, so this should be a lot faster.

v = df.groupby(['ITEM', 'STORE'], sort=False).STOCK.diff()
df['DELTA'] = np.where(np.isnan(v), 0, v)

Some pointers/ideas here:

  1. Don't iterate over groups
  2. Don't pass series as the groupers if the series belong to the same DataFrame. Pass string labels instead.
  3. diff can be vectorized
  4. The last line is tantamount to a fillna, but fillna is slower than np.where
  5. Specifying sort=False will prevent the output from being sorted by grouper keys, improving performance further

This can also be re-written as

df['DELTA'] = df.groupby(['ITEM', 'STORE'], sort=False).STOCK.diff().fillna(0)


来源:https://stackoverflow.com/questions/50727342/sort-values-in-dataframe-using-categorical-key-without-groupby-split-apply-combi

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!