问题
I am working with quite a large dataset (over 4 GB), which I imported in pandas
. Quite some columns in this dataset are simple True/False indicators, and naturally the most memory-efficient way to store these would be using a bool
dtype for this column. However, the column also contains some NaN values I want to preserve. Right now, this leads to the column having dtype float (with values 1.0
, 0.0
and np.nan
) or object, but they both use way too much memory.
As an example:
df = pd.DataFrame([[True,True,True],[False,False,False],
[np.nan,np.nan,np.nan]])
df[1] = df[1].astype(bool)
df[2] = df[2].astype(float)
print(df)
print(df.memory_usage(index=False, deep=True))
print(df.memory_usage(index=False, deep=False))
results in
0 1 2
0 True True 1.0
1 False False 0.0
2 NaN True NaN
0 100
1 3
2 24
dtype: int64
0 24
1 3
2 24
dtype: int64
What would be the most efficient way to store these kinds of values, knowing they can only take on 3 different kinds of values: True
, False
and <undefined>
回答1:
Use dtype: int8
1 = True
0 = False
-1 = NaN
This is 4 times better than float32
and 8 times better than float64
来源:https://stackoverflow.com/questions/50877663/memory-efficient-way-to-store-bool-and-nan-values-in-pandas