Is there a memory efficient way to replace a list of values in a pandas dataframe?

青春壹個敷衍的年華 提交于 2019-12-14 03:56:00

问题


I am trying to replace all of the unique strings in a large pandas dataframe (1.5 million rows, and about 15 columns) with an integer index. My problem is that my dataframe is 2Gigs and my list of unique strings ends up with around eighty thousand or more entries.

To produce my list of unique strings I use:

unique_string_list = pd.unique(df.values.ravel()).tolist()

Then if I try to use df.replace() either with a pair of lists or with a dictionary the memory overhead is too much for my 8 Gigs of RAM. The problem is in the size of the replacement list so even if I only use a few thousand row chunk of the dataframe it will eat all the RAM:

mapdict = dict(zip(unique_string_list, range(len(unique_string_list))))
replacedict = dict(zip(df.columns.values, [mapdict for column in df.columns.values]))
df.replace(replacedict)

I have tried looping over the string list instead. This reduced the memory overhead but it is very inefficient and takes too long to run (longer than overnight).

Any help here would be very much appreciated.

来源:https://stackoverflow.com/questions/26492270/is-there-a-memory-efficient-way-to-replace-a-list-of-values-in-a-pandas-datafram

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!