XGBoost Categorical Variables: Dummification vs encoding

后端 未结 3 840
面向向阳花
面向向阳花 2021-01-30 00:19

When using XGBoost we need to convert categorical variables into numeric.

Would there be any difference in performance/evaluation metrics between the method

3条回答
  •  旧时难觅i
    2021-01-30 01:04

    Here is a code example of adding One hot encodings columns to a Pandas DataFrame with Categorical columns:

    ONE_HOT_COLS = ["categorical_col1", "categorical_col2", "categorical_col3"]
    print("Starting DF shape: %d, %d" % df.shape)
    
    
    for col in ONE_HOT_COLS:
        s = df[col].unique()
    
        # Create a One Hot Dataframe with 1 row for each unique value
        one_hot_df = pd.get_dummies(s, prefix='%s_' % col)
        one_hot_df[col] = s
    
        print("Adding One Hot values for %s (the column has %d unique values)" % (col, len(s)))
        pre_len = len(df)
    
        # Merge the one hot columns
        df = df.merge(one_hot_df, on=[col], how="left")
        assert len(df) == pre_len
        print(df.shape)
    

提交回复
热议问题