Aggregate a Dask dataframe and produce a dataframe of aggregates

空扰寡人 提交于 2019-12-01 18:40:38

The following does indeed work:

gb = df.groupby(['customer', 'url', 'ts'])
gb.apply(lambda d: pd.DataFrame({'views': len(d), 
     'visitiors': d.session_id.count(), 
     'referrers': [d.referer.tolist()]})).reset_index()

(assuming visitors should be unique as per the sql above) You may wish to define the meta of the output.

This is the link to the github issue that @j-bennet opened that gives an additional option. Based on the issue we implemented the aggregation as follows:
custom_agg = dd.Aggregation( 'custom_agg', lambda s: s.apply(set), lambda s: s.apply(lambda chunks: list(set(itertools.chain.from_iterable(chunks)))), ).
In order to combine with the count the code is as follows
dfgp = df.groupby(['ID1','ID2']) df2 = dfgp.assign(cnt=dfgp.size()).agg(custom_agg).reset_index()

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!