Create dummies from column with multiple values in pandas

前端 未结 4 1143
说谎
说谎 2020-12-04 10:35

I am looking for for a pythonic way to handle the following problem.

The pandas.get_dummies() method is great to create dummies from a categorical colum

4条回答
  •  轻奢々
    轻奢々 (楼主)
    2020-12-04 11:17

    I believe this question needs an updated answer after coming across the MultiLabelBinarizer from sklearn.

    The usage of this is as simple as...

    # Instantiate the binarizer
    mlb = MultiLabelBinarizer()
    
    # Using OP's original data frame
    df = pd.DataFrame(data=['A', 'B', 'C', 'D', 'A*C', 'C*D'], columns=["label"])
    
    print(df)
      label
    0     A
    1     B
    2     C
    3     D
    4   A*C
    5   C*D
    
    # Convert to a list of labels
    df = df.apply(lambda x: x["label"].split("*"), axis=1)
    
    print(df)
    0       [A]
    1       [B]
    2       [C]
    3       [D]
    4    [A, C]
    5    [C, D]
    dtype: object
    
    # Transform to a binary array
    array_out = mlb.fit_transform(df)
    
    print(array_out)
    [[1 0 0 0]
     [0 1 0 0]
     [0 0 1 0]
     [0 0 0 1]
     [1 0 1 0]
     [0 0 1 1]]
    
    # Convert back to a dataframe (unnecessary step in many cases)
    df_out = pd.DataFrame(data=array_out, columns=mlb.classes_)
    
    print(df_out)
       A  B  C  D
    0  1  0  0  0
    1  0  1  0  0
    2  0  0  1  0
    3  0  0  0  1
    4  1  0  1  0
    5  0  0  1  1
    

    This is also very fast, took virtually no time (.03 seconds) across 1000 rows and 50K classes.

提交回复
热议问题