How to one hot encode variant length features?

后端 未结 2 1261
时光取名叫无心
时光取名叫无心 2020-11-27 07:42

Given a list of variant length features:

features = [
    [\'f1\', \'f2\', \'f3\'],
    [\'f2\', \'f4\', \'f5\', \'f6\'],
    [\'f1\', \'f2\']
]
2条回答
  •  死守一世寂寞
    2020-11-27 08:41

    You can use MultiLabelBinarizer present in scikit which is specifically used for doing this.

    Code for your example:

    features = [
                ['f1', 'f2', 'f3'],
                ['f2', 'f4', 'f5', 'f6'],
                ['f1', 'f2']
               ]
    from sklearn.preprocessing import MultiLabelBinarizer
    mlb = MultiLabelBinarizer()
    new_features = mlb.fit_transform(features)
    

    Output:

    array([[1, 1, 1, 0, 0, 0],
           [0, 1, 0, 1, 1, 1],
           [1, 1, 0, 0, 0, 0]])
    

    This can also be used in a pipeline, along with other feature_selection utilities.

提交回复
热议问题