问题
I have a dataframe with several string columns that I want to convert to categorical data so that I can run some models and extract important features from.
However, due to the amount of unique values, the one-hot encoded data expands into a large number of columns which is causing performance issues.
To combat this, I'm experimenting with the Sparse = True
parameter in get_dummies.
test1 = pd.get_dummies(X.loc[:,['col1','col2','col3','col4']].head(10000))
test2 = pd.get_dummies(X.loc[:,['col1','col2','col3','col4']].head(10000),sparse = True)
However, when I check info for my two comparison objects, they take up the same amount of memory. It doesn't seem like Sparse = True
uses any less space. Why is that?
test1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 537293 to 752152
Columns: 2253 entries,...
dtypes: uint8(2253)
memory usage: 21.6 MB
test2.info()
<class 'pandas.core.sparse.frame.SparseDataFrame'>
Int64Index: 10000 entries, 537293 to 752152
Columns: 2253 entries, ...
dtypes: uint8(2253)
memory usage: 21.9 MB
回答1:
I looked at pandas get_dummies source but could not spot an error so far. Here is a small experiment that I did below (1st half is reproducing your problem with real data).
In [1]: import numpy as np
...: import pandas as pd
...:
...: a = ['a', 'b'] * 100000
...: A = ['A', 'B'] * 100000
...:
...: df1 = pd.DataFrame({'a': a, 'A': A})
...: df1 = pd.get_dummies(df1)
...: df1.info()
...:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 4 columns):
A_A 200000 non-null uint8
A_B 200000 non-null uint8
a_a 200000 non-null uint8
a_b 200000 non-null uint8
dtypes: uint8(4)
memory usage: 781.3 KB
In [2]: df2 = pd.DataFrame({'a': a, 'A': A})
...: df2 = pd.get_dummies(df2, sparse=True)
...: df2.info()
...:
<class 'pandas.core.sparse.frame.SparseDataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 4 columns):
A_A 200000 non-null uint8
A_B 200000 non-null uint8
a_a 200000 non-null uint8
a_b 200000 non-null uint8
dtypes: uint8(4)
memory usage: 781.3 KB
So far the same result (the size of df1
is equal to that of df2
) as yours, but if I explicitly convert df2
to sparse
using to_sparse
with fill_value=0
In [3]: df2 = df2.to_sparse(fill_value=0)
...: df2.info()
...:
<class 'pandas.core.sparse.frame.SparseDataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 4 columns):
A_A 200000 non-null uint8
A_B 200000 non-null uint8
a_a 200000 non-null uint8
a_b 200000 non-null uint8
dtypes: uint8(4)
memory usage: 390.7 KB
Now the memory usage is half since half of the data is 0
.
In conclusion, I'm not sure why get_dummies(sparse=True) does not compress the dataframe even though it is converted to SparseDataFrame, but there is a workaround. Related discussion was going on in github get_dummies with sparse doesn't convert numeric to sparse but the conclusion still seems to be up in the air.
来源:https://stackoverflow.com/questions/51709377/pd-get-dummies-dataframe-same-size-when-sparse-true-as-when-sparse-false