Can we make the ML model (pickle file) more robust, by accepting (or ignoring) new features?

妖精的绣舞 提交于 2020-12-25 10:20:27

问题


  • I have trained a ML model, and stored it into a Pickle file.
  • In my new script, I am reading new 'real world data', on which I want to do a prediction.

However, I am struggling. I have a column (containing string values), like:

Sex       
Male       
Female
# This is just as example, in real it is having much more unique values

Now comes the issue. I received a new (unique) value, and now I cannot make predictions anymore (e.g. 'Neutral' was added).

Since I am transforming the 'Sex' column into Dummies, I do have the issue that my model is not accepting the input anymore,

Number of features of the model must match the input. Model n_features is 2 and input n_features is 3

Therefore my question: is there a way how I can make my model robust, and just ignore this class? But do a prediction, without the specific info?

What I have tried:

df = pd.read_csv('dataset_that_i_want_to_predict.csv')
model = pickle.load(open("model_trained.sav", 'rb'))

# I have an 'example_df' containing just 1 row of training data (this is exactly what the model needs)
example_df = pd.read_csv('reading_one_row_of_trainings_data.csv')

# Checking for missing columns, and adding that to the new dataset 
missing_cols = set(example_df.columns) - set(df.columns)
for column in missing_cols:
    df[column] = 0 #adding the missing columns, with 0 values (Which is ok. since everything is dummy)

# make sure that we have the same order 
df = df[example_df.columns] 

# The prediction will lead to an error!
results = model.predict(df)

# ValueError: Number of features of the model must match the input. Model n_features is X and n_features is Y

Note, I searched, but could not find any helpfull solution (not here, here or here

UPDATE

Also found this article. But same issue here.. we can make the test set with the same columns as training set... but what about new real world data (e.g. the new value 'Neutral')?


回答1:


Yes, you can't include (update the model) a new category or feature into a dataset after the training part is done. OneHotEncoder might handle the problem of having new categories inside some feature in test data. It will take care of keep the columns consistent in your training and test data with respect to categorical variables.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import pandas as pd
from sklearn import set_config
set_config(print_changed_only=True)
df = pd.DataFrame({'feature_1': np.random.rand(20),
                   'feature_2': np.random.choice(['male', 'female'], (20,))})
target = pd.Series(np.random.choice(['yes', 'no'], (20,)))

model = Pipeline([('preprocess',
                   ColumnTransformer([('ohe',
                                       OneHotEncoder(handle_unknown='ignore'), [1])],
                                       remainder='passthrough')),
                  ('lr', LogisticRegression())])

model.fit(df, target)

# let us introduce new categories in feature_2 in test data
test_df = pd.DataFrame({'feature_1': np.random.rand(20),
                        'feature_2': np.random.choice(['male', 'female', 'neutral', 'unknown'], (20,))})
model.predict(test_df)
# array(['yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
#       'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
#       'yes', 'yes'], dtype=object)


来源:https://stackoverflow.com/questions/64910582/can-we-make-the-ml-model-pickle-file-more-robust-by-accepting-or-ignoring-n

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!