Why does sklearn preprocessing LabelEncoder inverse_transform apply from only one column?

大憨熊 提交于 2019-12-22 01:30:16

问题


I have a random forest model built with sklearn. The model is built in one file, and I have a second file where I use joblib to load the model and apply it to new data. The data has categorical fields that are converted via sklearn's preprocessing LabelEncoder.fit_transform. Once the prediction is made, I am attempting to reverse this conversion with LabelEncoder.inverse_transform.

Here is the code:

 #transform the categorical rf inputs
 df["method"] = le.fit_transform(df["method"])
 df["vendor"] = le.fit_transform(df["vendor"])
 df["type"] = le.fit_transform(df["type"])
 df["name"] = le.fit_transform(df["name"])
 dups["address"] = le.fit_transform(df["address"])

 #designate inputs for rf model
 inputs = ["amt","vendor","type","name","address","method"]

 #load rf model and run it on new data
 from sklearn.externals import joblib
 rf = joblib.load('rf.pkl')
 predict = rf.predict(df[inputs])

 #reverse LabelEncoder fit_transform
 df["method"] = le.inverse_transform(df["method"])
 df["vendor"] = le.inverse_transform(df["vendor"])
 df["type"] = le.inverse_transform(df["type"])
 df["name"] = le.inverse_transform(df["name"])
 df["address"] = le.inverse_transform(df["address"])

 #convert target to numeric to make it play nice with SQL Server
 predict = pd.to_numeric(predict)

 #add target field to df
 df["prediction"] = predict

 #write results to SQL Server table
 import sqlalchemy
 engine = sqlalchemy.create_engine("mssql+pyodbc://<username>:<password>@UserDSN")
 df.to_sql('TABLE_NAME', engine, schema='SCHEMANAME', if_exists='replace', index=False)

Without the inverse_transform piece, the results are as expected: numeric codes in place of categorical values. With the inverse_transform piece, the results are odd: the categorical values corresponding to the "address" field are returned for all categorical fields.

So if 1600 Pennsylvania Avenue is encoded as the number 1, all categorical values encoded as the number 1 (regardless of field) now return 1600 Pennsylvania Avenue. Why is inverse_transform picking one column from which to reverse all fit_transform codes?


回答1:


This is the expected behaviour.

When you call le.fit_transform(), the internal parameters (classes learned) of the LabelEncoder are re-initialised. The le object is fitted onto the values of the column you supplied.

In the above code, you are using the same object to transform all columns, and the last column you supplied is the address. Hence the le forgets all info about previous calls to fit() (or fit_transform() in this case), and again learns the new data. So when you call inverse_transform() on it, it only returns values related to address. Hope I'm clear.

To encode all columns, you need to initialize different objects, one for each column. Something like below:

 df["method"] = le_method.fit_transform(df["method"])
 df["vendor"] = le_vendor.fit_transform(df["vendor"])
 df["type"] = le_type.fit_transform(df["type"])
 df["name"] = le_name.fit_transform(df["name"])
 df["address"] = le_address.fit_transform(df["address"])

and then call inverse_transform() on the appropriate encoder.



来源:https://stackoverflow.com/questions/43128020/why-does-sklearn-preprocessing-labelencoder-inverse-transform-apply-from-only-on

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!