问题:

I have a dataframe where one column is a list of groups each of my users belongs to. Something like:

index groups   0     ['a','b','c'] 1     ['c'] 2     ['b','c','e'] 3     ['a','c'] 4     ['b','e']

And what I would like to do is create a series of dummy columns to identify which groups each user belongs to in order to run some analyses

index  a   b   c   d   e 0      1   1   1   0   0 1      0   0   1   0   0 2      0   1   1   0   1 3      1   0   1   0   0 4      0   1   0   0   0   pd.get_dummies(df['groups'])

won't work because that just returns a column for each different list in my column.

The solution needs to be efficient as the dataframe will contain 500,000+ rows. Any advice would be appreciated!

回答1:

Using s for your df['groups']:

In [21]: s = pd.Series({0: ['a', 'b', 'c'], 1:['c'], 2: ['b', 'c', 'e'], 3: ['a', 'c'], 4: ['b', 'e'] })  In [22]: s Out[22]: 0    [a, b, c] 1          [c] 2    [b, c, e] 3       [a, c] 4       [b, e] dtype: object

This is a possible solution:

In [23]: pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0) Out[23]:    a  b  c  e 0  1  1  1  0 1  0  0  1  0 2  0  1  1  1 3  1  0  1  0 4  0  1  0  1

The logic of this is:

.apply(Series) converts the series of lists to a dataframe
.stack() puts everything in one column again (creating a multi-level index)
pd.get_dummies( ) creating the dummies
.sum(level=0) for remerging the different rows that should be one row (by summing up the second level, only keeping the original level (level=0))

An slight equivalent is pd.get_dummies(s.apply(pd.Series), prefix='', prefix_sep='').sum(level=0, axis=1)

If this will be efficient enough, I don't know, but in any case, if performance is important, storing lists in a dataframe is not a very good idea.

回答2:

Even though this quest was answered, I have a faster solution:

df.groups.apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')

And, in case you have empty groups or NaN, you could just:

df.loc[df.groups.str.len() > 0].apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')

How it works

Inside the lambda, x is your list, for example ['a', 'b', 'c']. So pd.Series will be as follows:

In [2]: pd.Series([1, 1, 1], index=['a', 'b', 'c']) Out[2]:  a    1 b    1 c    1 dtype: int64

When all pd.Series comes together, they become pd.DataFrame and their index become columns; missing index became a column with NaN as you can see next:

In [4]: a = pd.Series([1, 1, 1], index=['a', 'b', 'c']) In [5]: b = pd.Series([1, 1, 1], index=['a', 'b', 'd']) In [6]: pd.DataFrame([a, b]) Out[6]:       a    b    c    d 0  1.0  1.0  1.0  NaN 1  1.0  1.0  NaN  1.0

Now fillna fills those NaN with 0:

In [7]: pd.DataFrame([a, b]).fillna(0) Out[7]:       a    b    c    d 0  1.0  1.0  1.0  0.0 1  1.0  1.0  0.0  1.0

And downcast='infer' is to downcast from float to int:

In [11]: pd.DataFrame([a, b]).fillna(0, downcast='infer') Out[11]:     a  b  c  d 0  1  1  1  0 1  1  1  0  1

PS.: It's not required the use of .fillna(0, downcast='infer').

转载请标明出处:Pandas convert a column of list to dummies

文章来源: Pandas convert a column of list to dummies

标签

pandas

convert

dataframe