问题
I am working with quite a large dataset (over 4 GB), which I imported in pandas. Quite some columns in this dataset are simple True/False indicators, and naturally the most memory-efficient way to store these would be using a bool dtype for this column. However, the column also contains some NaN values I want to preserve. Right now, this leads to the column having dtype float (with values 1.0, 0.0 and np.nan) or object, but they both use way too much memory.
As an example:
df = pd.DataFrame([[True,True,True],[False,False,False],
[np.nan,np.nan,np.nan]])
df[1] = df[1].astype(bool)
df[2] = df[2].astype(float)
print(df)
print(df.memory_usage(index=False, deep=True))
print(df.memory_usage(index=False, deep=False))
results in
0 1 2
0 True True 1.0
1 False False 0.0
2 NaN True NaN
0 100
1 3
2 24
dtype: int64
0 24
1 3
2 24
dtype: int64
What would be the most efficient way to store these kinds of values, knowing they can only take on 3 different kinds of values: True, False and <undefined>
回答1:
Use dtype: int8
1 = True
0 = False
-1 = NaN
This is 4 times better than float32 and 8 times better than float64
来源:https://stackoverflow.com/questions/50877663/memory-efficient-way-to-store-bool-and-nan-values-in-pandas