If a sklearn.LabelEncoder has been fitted on a training set, it might break if it encounters new values when used on a test set.
The only solution I c
I was trying to deal with this problem and found two handy ways to encode categorical data from train and test sets with and without using LabelEncoder. New categories are filled with some known cetegory "c" (like "other" or "missing"). First method seems to work faster. Hope that will help you.
import pandas as pd
import time
df=pd.DataFrame()
df["a"]=['a','b', 'c', 'd']
df["b"]=['a','b', 'e', 'd']
#LabelEncoder + map
t=time.clock()
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
suf="_le"
col="a"
df[col+suf] = le.fit_transform(df[col])
dic = dict(zip(le.classes_, le.transform(le.classes_)))
col='b'
df[col+suf]=df[col].map(dic).fillna(dic["c"]).astype(int)
print(time.clock()-t)
#---
#pandas category
t=time.clock()
df["d"] = df["a"].astype('category').cat.codes
dic =df["a"].astype('category').cat.categories.tolist()
df['f']=df['b'].astype('category',categories=dic).fillna("c").cat.codes
df.dtypes
print(time.clock()-t)