Python: Unstacked DataFrame is too big, causing int32 overflow

人走茶凉 提交于 2021-02-19 09:44:14

问题


I have a big dataset and when I try to run this code I get a memory error.

user_by_movie = user_items.groupby(['user_id', 'movie_id'])['rating'].max().unstack()

here is the error:

ValueError: Unstacked DataFrame is too big, causing int32 overflow

I have run it on another machine and it worked fine! how can I fix this error?


回答1:


As it turns out this was not an issue on pandas 0.21. I am using a Jupyter notebook and I need the latest version of pandas for the rest of the code. So I did this:

!pip install pandas==0.21
import pandas as pd
user_by_movie = user_items.groupby(['user_id', 'movie_id'])['rating'].max().unstack()
!pip install pandas

This code works on the Jupyter notebook. First, it downgrades pandas to 0.21 and runs the code. After having the required dataset it updates pandas to the latest version. check the issue raised on GitHub here. This post was also helpful to increase memory of Jupyter notebook.




回答2:


According to Google, you can downgrade your pandas version to 0.21 which has no problem with pivot table and too big data.




回答3:


As @Ehsan pointed out, we can pivot the tables in chunks.

Suppose you have a DataFrame with 3,355,205 rows!
Let's build chunks of size 5000:

chunk_size = 5000
chunks = [x for x in range(0, df.shape[0], chunk_size)]

for i in range(0, len(chunks) - 1):
    print(chunks[i], chunks[i + 1] - 1)

0 4999
5000 9999
10000 14999
15000 19999
20000 24999
25000 29999
30000 34999
35000 39999
40000 44999
45000 49999
50000 54999
55000 59999
60000 64999
65000 69999
70000 74999
75000 79999
..continue..

All you have to do now is a list comprehension inside a pd.concat():

df_new = pd.concat([df.iloc[ chunks[i]:chunks[i + 1] - 1 ].pivot(index='user_id', columns='item', values='views') for i in range(0, len(chunks) - 1)])

This answer is good when you have to make a sparse matrix to some recommendation system.
After this you could do:

from scipy import sparse
spr = sparse.coo_matrix(df_new.to_numpy())


来源:https://stackoverflow.com/questions/61757170/python-unstacked-dataframe-is-too-big-causing-int32-overflow

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!