I\'m getting this error when trying to predict using a model I built in scikit learn. I know that there are a bunch of questions about this but mine seems different from the
You can utilize the Categorical Dtype to apply null values to unseen data.
Input:
import pandas as pd
import numpy as np
from pandas.api.types import CategoricalDtype
# Create Example Data
train = pd.DataFrame({"text":["A", "B", "C", "D", 'F', np.nan]})
test = pd.DataFrame({"text":["D", "D", np.nan,"B", "E", "T"]})
# Convert columns to category dtype and specify categories for test set
train['text'] = train['text'].astype('category')
test['text'] = test['text'].astype(CategoricalDtype(categories=train['text'].cat.categories))
# Create Dummies
pd.get_dummies(test['text'], dummy_na=True)
Output:
| A | B | C | D | F | nan |
|---|---|---|---|---|-----|
| 0 | 0 | 0 | 1 | 0 | 0 |
| 0 | 0 | 0 | 1 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 1 |
| 0 | 1 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 1 |
| 0 | 0 | 0 | 0 | 0 | 1 |