Jupyter notebook kernel dies when creating dummy variables with pandas

心已入冬 提交于 2019-12-07 06:27:01

问题


I am working on the Walmart Kaggle competition and I'm trying to create a dummy column of of the "FinelineNumber" column. For context, df.shape returns (647054, 7). I am trying to make a dummy column for df['FinelineNumber'], which has 5,196 unique values. The results should be a dataframe of shape (647054, 5196), which I then plan to concat to the original dataframe.

Nearly every time I run fineline_dummies = pd.get_dummies(df['FinelineNumber'], prefix='fl'), I get the following error message The kernel appears to have died. It will restart automatically. I am running python 2.7 in jupyter notebook on a MacBookPro with 16GB RAM.

Can someone explain why this is happening (and why it happens most of the time but not every time)? Is it a jupyter notebook or pandas bug? Also, I thought it might have to do with not enough RAM but I get the same error on a Microsoft Azure Machine Learning notebook with >100 GB of RAM. On Azure ML, the kernel dies every time - almost immediately.


回答1:


It very much could be memory usage - a 647054, 5196 data frame has 3,362,092,584 elements, which would be 24GB just for the pointers to the objects on a 64-bit system. On AzureML while the VM has a large amount of memory you're actually limited in how much memory you have available (currently 2GB, soon to be 4GB) - and when you hit the limit the kernel typically dies. So it seems very likely it is a memory usage issue.

You might try doing .to_sparse() on the data frame first before doing any additional manipulations. That should allow Pandas to keep most of the data frame out of memory.



来源:https://stackoverflow.com/questions/34047782/jupyter-notebook-kernel-dies-when-creating-dummy-variables-with-pandas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!