Dask DataFrame: Resample over groupby object with multiple rows

試著忘記壹切 提交于 2019-12-04 18:09:15

If we can assume that each user-id group can fit in memory then I recommend using dask.dataframe to do the outer-groupby but then using pandas to do the operations within each group, something like the following.

def per_group(blk):
    return blk.groupby('ts').text.resample('3H', how='sum')

df.groupby('user_id').apply(per_group, columns=['ts', 'text']).compute()

This decouples two hard things into the two different projects

  1. Shuffling all of the user-ids together into the right groups is handled by dask.dataframe
  2. Doing the complex datetime resampling within each group is handled explicitly by pandas.

Ideally dask.dataframe would write the per-group function for you automatically. At the moment dask.dataframe does not intelligently handle multi-indexes, or resampling on top of multi-column groupbys, so the automatic solution isn't yet available. Still, it's quite possible to fall back to pandas for the per-block computation while still using dask.dataframe to prepare the groups accordingly.

Try converting your index to a DatetimeIndex like this:

import datetime
# ...
df.index = dd.DatetimeIndex(df.index.map(lambda x: datetime.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')))
# ...
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!