I am looking for for a pythonic way to handle the following problem.
The pandas.get_dummies() method is great to create dummies from a categorical colum
You can generate the dummies dataframe with your raw data, isolate the columns that contains a given atom, and then store the result matches back to the atom column.
df
Out[28]:
label
0 A
1 B
2 C
3 D
4 A*C
5 C*D
dummies = pd.get_dummies(df['label'])
atom_col = [c for c in dummies.columns if '*' not in c]
for col in atom_col:
...: df[col] = dummies[[c for c in dummies.columns if col in c]].sum(axis=1)
...:
df
Out[32]:
label A B C D
0 A 1 0 0 0
1 B 0 1 0 0
2 C 0 0 1 0
3 D 0 0 0 1
4 A*C 1 0 1 0
5 C*D 0 0 1 1