XGBoost Categorical Variables: Dummification vs encoding

后端未结

关注

 3  864

面向向阳花 2021-01-30 00:19

When using XGBoost we need to convert categorical variables into numeric.

Would there be any difference in performance/evaluation metrics between the method

3条回答

旧时难觅i (楼主)

2021-01-30 01:04

Here is a code example of adding One hot encodings columns to a Pandas DataFrame with Categorical columns:

ONE_HOT_COLS = ["categorical_col1", "categorical_col2", "categorical_col3"]
print("Starting DF shape: %d, %d" % df.shape)


for col in ONE_HOT_COLS:
    s = df[col].unique()

    # Create a One Hot Dataframe with 1 row for each unique value
    one_hot_df = pd.get_dummies(s, prefix='%s_' % col)
    one_hot_df[col] = s

    print("Adding One Hot values for %s (the column has %d unique values)" % (col, len(s)))
    pre_len = len(df)

    # Merge the one hot columns
    df = df.merge(one_hot_df, on=[col], how="left")
    assert len(df) == pre_len
    print(df.shape)

0 讨论(0)

查看其它3个回答