Python Memory Error encountered when replacing NaN values in large Pandas dataframe

a 夏天 提交于 2020-02-04 23:06:32

问题


I have a very large pandas dataframe: ~300,000 columns and ~17,520 rows. The pandas dataframe is called result_full. I am attempting to replace all of the strings "NaN" with numpy.nan:

result_full.replace(["NaN"], np.nan, inplace = True)

Here is where I get MemoryError Is there a memory efficient way to drop these strings in my dataframe? I tried result_full.dropna() but it didn't work because they are technically string that are "NaN"


回答1:


One of the issues could be because of using a 32-bit Machine as it can process a maximum of 2GB of data at a time. If possible, scale up to a 64-bit machine to avoid problems in the future.

Meanwhile, there could be a hack to this. Convert the dataframe to CSV by using the df.to_csv() option. Once that's done, if you look into the documentation of the df.read_csv() in the pandas documentation of read_csv, you shall notice this parameter

na_values : scalar, str, list-like, or dict, default None

Additional strings to recognize as NA/NaN. If dict passed, specific   per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘nan’`.

So,it shall recognize the string 'NaN' as np.nan and your problem shall be solved.

Meanwhile, if you are directly creating this Dataframe through a CSV, you could use this parameter to avoid the memory problem. Hope it helps. Cheers!



来源:https://stackoverflow.com/questions/44299013/python-memory-error-encountered-when-replacing-nan-values-in-large-pandas-datafr

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!