Memory efficient way to store bool and NaN values in pandas

徘徊边缘 提交于 2020-01-04 07:37:07

问题


I am working with quite a large dataset (over 4 GB), which I imported in pandas. Quite some columns in this dataset are simple True/False indicators, and naturally the most memory-efficient way to store these would be using a bool dtype for this column. However, the column also contains some NaN values I want to preserve. Right now, this leads to the column having dtype float (with values 1.0, 0.0 and np.nan) or object, but they both use way too much memory.

As an example:

df = pd.DataFrame([[True,True,True],[False,False,False], 
                   [np.nan,np.nan,np.nan]])
df[1] = df[1].astype(bool)
df[2] = df[2].astype(float)
print(df)
print(df.memory_usage(index=False, deep=True))
print(df.memory_usage(index=False, deep=False))

results in

       0      1    2
0   True   True  1.0
1  False  False  0.0
2    NaN   True  NaN

0       100
1         3
2        24
dtype: int64

0        24
1         3
2        24
dtype: int64

What would be the most efficient way to store these kinds of values, knowing they can only take on 3 different kinds of values: True, False and <undefined>


回答1:


Use dtype: int8

1 = True
0 = False
-1 = NaN

This is 4 times better than float32 and 8 times better than float64



来源:https://stackoverflow.com/questions/50877663/memory-efficient-way-to-store-bool-and-nan-values-in-pandas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!