basic groupby operations in Dask

萝らか妹 提交于 2019-12-10 21:30:12

问题


I am attempting to use Dask to handle a large file (50 gb). Typically, I would load it in memory and use Pandas. I want to groupby two columns "A", and "B", and whenever column "C" starts with a value, I want to repeat that value in that column for that particular group.

In pandas, I would do the following:

df['C'] = df.groupby(['A','B'])['C'].fillna(method = 'ffill')

What would be the equivalent in Dask? Also, I am a little bit lost as to how to structure problems in Dask as opposed to in Pandas,

thank you,

My progress so far:

First set index:

df1 = df.set_index(['A','B'])

Then groupby:

df1.groupby(['A','B']).apply(lambda x: x.fillna(method='ffill').compute()

回答1:


It appears dask does not currently implement the fillna method for GroupBy objects. I've tried PRing it some time ago and gave up quite quickly.

Also, dask doesn't support the method parameter (as it isn't always trivial to implement with delayed algorithms).

A workaround for this could be using fillna before grouping, like so:

df['C'] = df.fillna(0).groupby(['A','B'])['C']

Although this wasn't tested.

You can find my (failed) attempt here: https://github.com/nirizr/dask/tree/groupy_fillna



来源:https://stackoverflow.com/questions/38901845/basic-groupby-operations-in-dask

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!