Label encoding across multiple columns in scikit-learn

后端 未结 22 2341
礼貌的吻别
礼貌的吻别 2020-11-22 09:02

I\'m trying to use scikit-learn\'s LabelEncoder to encode a pandas DataFrame of string labels. As the dataframe has many (50+) columns, I want to a

22条回答
  •  爱一瞬间的悲伤
    2020-11-22 09:44

    You can easily do this though,

    df.apply(LabelEncoder().fit_transform)
    

    EDIT2:

    In scikit-learn 0.20, the recommended way is

    OneHotEncoder().fit_transform(df)
    

    as the OneHotEncoder now supports string input. Applying OneHotEncoder only to certain columns is possible with the ColumnTransformer.

    EDIT:

    Since this answer is over a year ago, and generated many upvotes (including a bounty), I should probably extend this further.

    For inverse_transform and transform, you have to do a little bit of hack.

    from collections import defaultdict
    d = defaultdict(LabelEncoder)
    

    With this, you now retain all columns LabelEncoder as dictionary.

    # Encoding the variable
    fit = df.apply(lambda x: d[x.name].fit_transform(x))
    
    # Inverse the encoded
    fit.apply(lambda x: d[x.name].inverse_transform(x))
    
    # Using the dictionary to label future data
    df.apply(lambda x: d[x.name].transform(x))
    

提交回复
热议问题