Python Memory Error encountered when replacing NaN values in large Pandas dataframe

问题

I have a very large pandas dataframe: ~300,000 columns and ~17,520 rows. The pandas dataframe is called result_full. I am attempting to replace all of the strings "NaN" with numpy.nan:

result_full.replace(["NaN"], np.nan, inplace = True)

Here is where I get MemoryError Is there a memory efficient way to drop these strings in my dataframe? I tried result_full.dropna() but it didn't work because they are technically string that are "NaN"

回答1:

One of the issues could be because of using a 32-bit Machine as it can process a maximum of 2GB of data at a time. If possible, scale up to a 64-bit machine to avoid problems in the future.

Meanwhile, there could be a hack to this. Convert the dataframe to CSV by using the df.to_csv() option. Once that's done, if you look into the documentation of the df.read_csv() in the pandas documentation of read_csv, you shall notice this parameter

na_values : scalar, str, list-like, or dict, default None

Additional strings to recognize as NA/NaN. If dict passed, specific   per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘nan’`.

So,it shall recognize the string 'NaN' as np.nan and your problem shall be solved.

Meanwhile, if you are directly creating this Dataframe through a CSV, you could use this parameter to avoid the memory problem. Hope it helps. Cheers!

来源：https://stackoverflow.com/questions/44299013/python-memory-error-encountered-when-replacing-nan-values-in-large-pandas-datafr

标签

python

pandas

memory

dataframe