Avoiding Memory Issues For GroupBy on Large Pandas DataFrame

寵の児 提交于 2019-12-01 07:47:05

问题


Update:

The pandas df was created like this:

df = pd.read_sql(query, engine)
encoded = pd.get_dummies(df, columns=['account'])

Creating a dask df from this df looks like this:

df = dd.from_pandas(encoded, 50)

Performing the operation with dask results in no visible progress being made (checking with dask diagnostics):

result = df.groupby('journal_entry').max().reset_index().compute()

Original:

I have a large pandas df with 2.7M rows and 4,000 columns. All but four of the columns are of dtype uint8. The uint8 columns only hold values of 1 or 0. I am attempting to perform this operation on the df:

result = df.groupby('id').max().reset_index()

Predictably, this operation immediately returns a memory error. My initial thought is to chunk the df both horizontally and vertically. However, this creates a messy situation, since the .max() needs to be performed across all the uint8 columns, not just a pair of columns. In addition, it is still extremely slow to chunk the df like this. I have 32 GB of RAM on my machine.

What strategy could mitigate the memory issue?


回答1:


you could use dask.dataframe for this task

import dask.dataframe as dd
df = dd.from_pandas(df)
result = df.groupby('id').max().reset_index().compute()

All you need to do is convert your pandas.DataFrame into a dask.dataframe. Dask is a python out-of-core parallelization framework that offers various parallelized container types, one of which is the dataframe. It let's you perform most common pandas.DataFrame operations in parallel and/or distributed with data that is too large to fit in memory. The core of dask is a set of schedulers and an API for building computation graphs, hence we have to call .compute() at the end in order for any computation to actually take place. The library is easy to install because it is written in pure python for the most part.




回答2:


If you have any categorical columns in your data (rather than categories stored as object columns or strings), make sure you use the observed=True option in your groupby command. This makes sure it only creates lines where an entry is present, e.g. only one line per customer_id,order_id combination, rather than creating n_custs * n_orders lines!

I just did a groupby-sum on a 26M row dataset, never going above 7GB of RAM. Before adding the observed=True option, it was going up to 62GB and then running out.




回答3:


As an idea i would say, splitting the data column wise let's say four times, and use the id for each subset to perform the operations and then remerge



来源:https://stackoverflow.com/questions/50051210/avoiding-memory-issues-for-groupby-on-large-pandas-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!