Avoiding Memory Issues For GroupBy on Large Pandas DataFrame

风流意气都作罢 提交于 2019-12-01 09:50:30

you could use dask.dataframe for this task

import dask.dataframe as dd
df = dd.from_pandas(df)
result = df.groupby('id').max().reset_index().compute()

All you need to do is convert your pandas.DataFrame into a dask.dataframe. Dask is a python out-of-core parallelization framework that offers various parallelized container types, one of which is the dataframe. It let's you perform most common pandas.DataFrame operations in parallel and/or distributed with data that is too large to fit in memory. The core of dask is a set of schedulers and an API for building computation graphs, hence we have to call .compute() at the end in order for any computation to actually take place. The library is easy to install because it is written in pure python for the most part.

If you have any categorical columns in your data (rather than categories stored as object columns or strings), make sure you use the observed=True option in your groupby command. This makes sure it only creates lines where an entry is present, e.g. only one line per customer_id,order_id combination, rather than creating n_custs * n_orders lines!

I just did a groupby-sum on a 26M row dataset, never going above 7GB of RAM. Before adding the observed=True option, it was going up to 62GB and then running out.

As an idea i would say, splitting the data column wise let's say four times, and use the id for each subset to perform the operations and then remerge

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!