When using XGBoost
we need to convert categorical variables into numeric.
Would there be any difference in performance/evaluation metrics between the method
Here is a code example of adding One hot encodings columns to a Pandas DataFrame with Categorical columns:
ONE_HOT_COLS = ["categorical_col1", "categorical_col2", "categorical_col3"]
print("Starting DF shape: %d, %d" % df.shape)
for col in ONE_HOT_COLS:
s = df[col].unique()
# Create a One Hot Dataframe with 1 row for each unique value
one_hot_df = pd.get_dummies(s, prefix='%s_' % col)
one_hot_df[col] = s
print("Adding One Hot values for %s (the column has %d unique values)" % (col, len(s)))
pre_len = len(df)
# Merge the one hot columns
df = df.merge(one_hot_df, on=[col], how="left")
assert len(df) == pre_len
print(df.shape)