sklearn.LabelEncoder with never seen before values

后端 未结 12 999
执笔经年
执笔经年 2020-11-27 10:37

If a sklearn.LabelEncoder has been fitted on a training set, it might break if it encounters new values when used on a test set.

The only solution I c

12条回答
  •  隐瞒了意图╮
    2020-11-27 11:13

    I was trying to deal with this problem and found two handy ways to encode categorical data from train and test sets with and without using LabelEncoder. New categories are filled with some known cetegory "c" (like "other" or "missing"). First method seems to work faster. Hope that will help you.

    import pandas as pd
    import time
    df=pd.DataFrame()
    
    df["a"]=['a','b', 'c', 'd']
    df["b"]=['a','b', 'e', 'd']
    
    
    #LabelEncoder + map
    t=time.clock()
    from sklearn.preprocessing import LabelEncoder
    le = LabelEncoder()
    suf="_le"
    col="a"
    df[col+suf] = le.fit_transform(df[col])
    dic = dict(zip(le.classes_, le.transform(le.classes_)))
    col='b'
    df[col+suf]=df[col].map(dic).fillna(dic["c"]).astype(int)
    print(time.clock()-t)
    
    #---
    #pandas category
    
    t=time.clock()
    df["d"] = df["a"].astype('category').cat.codes
    dic =df["a"].astype('category').cat.categories.tolist()
    df['f']=df['b'].astype('category',categories=dic).fillna("c").cat.codes
    df.dtypes
    print(time.clock()-t)
    

提交回复
热议问题