I am looking for for a pythonic way to handle the following problem.
The pandas.get_dummies()
method is great to create dummies from a categorical colum
I believe this question needs an updated answer after coming across the MultiLabelBinarizer from sklearn.
The usage of this is as simple as...
# Instantiate the binarizer
mlb = MultiLabelBinarizer()
# Using OP's original data frame
df = pd.DataFrame(data=['A', 'B', 'C', 'D', 'A*C', 'C*D'], columns=["label"])
print(df)
label
0 A
1 B
2 C
3 D
4 A*C
5 C*D
# Convert to a list of labels
df = df.apply(lambda x: x["label"].split("*"), axis=1)
print(df)
0 [A]
1 [B]
2 [C]
3 [D]
4 [A, C]
5 [C, D]
dtype: object
# Transform to a binary array
array_out = mlb.fit_transform(df)
print(array_out)
[[1 0 0 0]
[0 1 0 0]
[0 0 1 0]
[0 0 0 1]
[1 0 1 0]
[0 0 1 1]]
# Convert back to a dataframe (unnecessary step in many cases)
df_out = pd.DataFrame(data=array_out, columns=mlb.classes_)
print(df_out)
A B C D
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 0 1 0
5 0 0 1 1
This is also very fast, took virtually no time (.03 seconds) across 1000 rows and 50K classes.