Prediction After One-hot encoding

扶醉桌前 提交于 2019-12-24 03:00:44

问题


I am trying with a sample dataFrame :

data = [['Alex','USA',0],['Bob','India',1],['Clarke','SriLanka',0]]

df = pd.DataFrame(data,columns=['Name','Country','Traget'])

Now from here, I used get_dummies to convert string column to an integer:

column_names=['Name','Country']  

one_hot = pd.get_dummies(df[column_names])  

After conversion the columns are: Age,Name_Alex,Name_Bob,Name_Clarke,Country_India,Country_SriLanka,Country_USA

Slicing the data.

x=df[["Name_Alex","Name_Bob","Name_Clarke","Country_India","Country_SriLanka","Country_USA"]].values  

y=df['Age'].values

Splitting the dataset in train and test

from sklearn.cross_validation import train_test_split

x_train,x_test,y_train,y_test=train_test_split(x,y,train_size=float(0.5),random_state=0)

Logistic Regression

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()

logreg.fit(x_train, y_train)

Now, model is trained.

For prediction let say i want to predict the "target" by giving "Name" and "Country".
Like : ["Alex","USA"].

Prediction.

If I used this:

logreg.predict([["Alex","USA"]).    

obviously it will not work.

Question1) How to test the prediction after applying one-hot encoding during training?

Question2) How to do prediction on a sample csv file which contains only "Name" and "Country"?


回答1:


I suggest you to use sklearn label encoders and one hot encoder packages instead of pd.get_dummies.

Once you initialise label encoder and one hot encoder per feature then save it somewhere so that when you want to do prediction on the data you can easily import saved label encoders and one hot encoders and encode your features again.

This way you are encoding your features again in the same way as you did while making training set.

Below is the code which I use for saving encoders:

labelencoder_dict = {}
onehotencoder_dict = {}
X_train = None
for i in range(0, X.shape[1]):
    label_encoder = LabelEncoder()
    labelencoder_dict[i] = label_encoder
    feature = label_encoder.fit_transform(X[:,i])
    feature = feature.reshape(X.shape[0], 1)
    onehot_encoder = OneHotEncoder(sparse=False)
    feature = onehot_encoder.fit_transform(feature)
    onehotencoder_dict[i] = onehot_encoder
    if X_train is None:
      X_train = feature
    else:
      X_train = np.concatenate((X_train, feature), axis=1)

Now I save this onehotencoder_dict and label encoder_dict and use it later for encoding.

def getEncoded(test_data,labelencoder_dict,onehotencoder_dict):
    test_encoded_x = None
    for i in range(0,test_data.shape[1]):
        label_encoder =  labelencoder_dict[i]
        feature = label_encoder.transform(test_data[:,i])
        feature = feature.reshape(test_data.shape[0], 1)
        onehot_encoder = onehotencoder_dict[i]
        feature = onehot_encoder.transform(feature)
        if test_encoded_x is None:
          test_encoded_x = feature
        else:
          test_encoded_x = np.concatenate((test_encoded_x, feature), axis=1)
  return test_encoded_x


来源:https://stackoverflow.com/questions/54786266/prediction-after-one-hot-encoding

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!