Pandas column of lists to separate columns

问题

Problem

Incoming data is a list of 0+ categories:

#input data frame
df = pd.DataFrame({'categories':(list('ABC'), list('BC'), list('A'))})

  categories
0  [A, B, C]
1     [B, C]
2        [A]

I would like to convert this to a DataFrame with one column per category and a 0/1 in each cell:

#desired output

   A  B  C
0  1  1  1
1  0  1  1
2  1  0  0

Attempt

OneHotEncoder with LabelEncoder get stuck because they don't handle lists in cells. The desired result is currently achieved with nested for loops:

#get unique categories ['A','B','C']
categories = np.unique(np.concatenate(x['categories']))

#make empty data frame
binary_df = pd.DataFrame(columns=[c for c in categories],
                         index=x.index)

print(binary_df)
     A    B    C
0  NaN  NaN  NaN
1  NaN  NaN  NaN
2  NaN  NaN  NaN


#fill data frame
for i in binary_df.index:
    for c in categories:
        binary_df.loc[i][c] = 1 if c in np.concatenate(x.loc[i]) else 0

My concern is the loops indicate this is an extremely inefficient way to handle a large data set (tens of categories, ten-of-thousands or more rows).

Is there a way to achieve the result with built-in Numpy/Scikit functions?

回答1:

Solution:

pd.get_dummies(pd.DataFrame(df['categories'].tolist()).stack()).sum(level=0)
Out[98]: 
   A  B  C
0  1  1  1
1  0  1  1
2  1  0  0

How it works:

pd.DataFrame(df['categories'].tolist())
Out[100]: 
   0     1     2
0  A     B     C
1  B     C  None
2  A  None  None

gets the series of lists turned into a dataframe.

pd.DataFrame(df['categories'].tolist()).stack()
Out[101]: 
0  0    A
   1    B
   2    C
1  0    B
   1    C
2  0    A
dtype: object

prepares for get_dummies while preserving the indices for later.

pd.get_dummies(pd.DataFrame(df['categories'].tolist()).stack())
Out[102]: 
     A  B  C
0 0  1  0  0
  1  0  1  0
  2  0  0  1
1 0  0  1  0
  1  0  0  1
2 0  1  0  0

is almost there, but contains the garbage information of value index in the initial list.

So the solution above sums over this level of the MultiIndex.

Edit:

%timeit results:

On original dataframe

df = pd.DataFrame({'categories':(list('ABC'), list('BC'), list('A'))})

Solution provided in question: 100 loops, best of 3: 3.24 ms per loop

This solution: 100 loops, best of 3: 2.29 ms per loop

300 rows

df = pd.concat(100*[df]).reset_index(drop=True)

Solution provided in question: 1 loop, best of 3: 252 ms per loop

This solution: 100 loops, best of 3: 2.45 ms per loop

回答2:

You can try appending rows with map such that by default it would set to 0 and update to 1 if the column is present in input dataframe row.

#input data frame
df = pd.DataFrame({'categories':(list('ABC'), list('BC'), list('A'))})
print(df)

Output:

   categories
0  [A, B, C]
1     [B, C]
2        [A]

For output dataframe:

categories = np.unique(np.concatenate(df['categories']))
#make empty data frame
binary_df = pd.DataFrame(columns=[c for c in categories],
                     index=df.index).dropna()

for index, row in df.iterrows():
    row_elements = row['categories']
    default_row = {item:0 for item in categories}
    # update corresponding row value by updating dictionary
    for i in row_elements:
        default_row[i] = 1
    binary_df = binary_df.append(default_row, ignore_index=True)

print(binary_df)

Output:

     A    B    C
0  1.0  1.0  1.0
1  0.0  1.0  1.0
2  1.0  0.0  0.0

来源：https://stackoverflow.com/questions/44657603/pandas-column-of-lists-to-separate-columns

标签

python

numpy

encoding

scikit-learn