Create dummies from column with multiple values in pandas

前端未结

关注

 4  1143

说谎 2020-12-04 10:35

I am looking for for a pythonic way to handle the following problem.

The pandas.get_dummies() method is great to create dummies from a categorical colum

4条回答

轻奢々 (楼主)

2020-12-04 11:17

I believe this question needs an updated answer after coming across the MultiLabelBinarizer from sklearn.

The usage of this is as simple as...

# Instantiate the binarizer
mlb = MultiLabelBinarizer()

# Using OP's original data frame
df = pd.DataFrame(data=['A', 'B', 'C', 'D', 'A*C', 'C*D'], columns=["label"])

print(df)
  label
0     A
1     B
2     C
3     D
4   A*C
5   C*D

# Convert to a list of labels
df = df.apply(lambda x: x["label"].split("*"), axis=1)

print(df)
0       [A]
1       [B]
2       [C]
3       [D]
4    [A, C]
5    [C, D]
dtype: object

# Transform to a binary array
array_out = mlb.fit_transform(df)

print(array_out)
[[1 0 0 0]
 [0 1 0 0]
 [0 0 1 0]
 [0 0 0 1]
 [1 0 1 0]
 [0 0 1 1]]

# Convert back to a dataframe (unnecessary step in many cases)
df_out = pd.DataFrame(data=array_out, columns=mlb.classes_)

print(df_out)
   A  B  C  D
0  1  0  0  0
1  0  1  0  0
2  0  0  1  0
3  0  0  0  1
4  1  0  1  0
5  0  0  1  1

This is also very fast, took virtually no time (.03 seconds) across 1000 rows and 50K classes.

0 讨论(0)

查看其它4个回答