How to one hot encode several categorical variables in R

前端 未结 5 933
再見小時候
再見小時候 2020-12-01 06:55

I\'m working on a prediction problem and I\'m building a decision tree in R, I have several categorical variables and I\'d like to one-hot encode them consistently in my tra

5条回答
  •  我在风中等你
    2020-12-01 07:19

    Here's a simple solution to one-hot-encode your category using no packages.

    Solution

    model.matrix(~0+category)

    It needs your categorical variable to be a factor. The factor levels must be the same in your training and test data, check with levels(train$category) and levels(test$category). It doesn't matter if some levels don't occur in your test set.

    Example

    Here's an example using the iris dataset.

    data(iris)
    #Split into train and test sets.
    train <- sample(1:nrow(iris),100)
    test <- -1*train
    
    iris[test,]
    
        Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
    34           5.5         4.2          1.4         0.2    setosa
    106          7.6         3.0          6.6         2.1 virginica
    112          6.4         2.7          5.3         1.9 virginica
    127          6.2         2.8          4.8         1.8 virginica
    132          7.9         3.8          6.4         2.0 virginica
    

    model.matrix() creates a column for each level of the factor, even if it is not present in the data. Zero indicates it is not that level, one indicates it is. Adding the zero specifies that you do not want an intercept or reference level and is equivalent to -1.

    oh_train <- model.matrix(~0+iris[train,'Species'])
    oh_test <- model.matrix(~0+iris[test,'Species'])
    
    #Renaming the columns to be more concise.
    attr(oh_test, "dimnames")[[2]] <- levels(iris$Species)
    
    
      setosa versicolor virginica
    1      1          0         0
    2      0          0         1
    3      0          0         1
    4      0          0         1
    5      0          0         1
    

    P.S. It's generally preferable to include all categories in training and test data. But that's none of my business.

提交回复
热议问题