How to one hot encode several categorical variables in R

前端 未结 5 923
再見小時候
再見小時候 2020-12-01 06:55

I\'m working on a prediction problem and I\'m building a decision tree in R, I have several categorical variables and I\'d like to one-hot encode them consistently in my tra

相关标签:
5条回答
  • 2020-12-01 07:07

    Code

    library(data.table)
    library(mltools)
    customers_1h <- one_hot(as.data.table(customers))
    

    Result

    > customers_1h
    id gender_female gender_male mood_happy mood_sad outcome
    1: 10             0           1          1        0       1
    2: 20             1           0          0        1       1
    3: 30             1           0          1        0       0
    4: 40             0           1          0        1       0
    5: 50             1           0          1        0       0
    

    Data

    customers <- data.frame(
      id=c(10, 20, 30, 40, 50),
      gender=c('male', 'female', 'female', 'male', 'female'),
      mood=c('happy', 'sad', 'happy', 'sad','happy'),
      outcome=c(1, 1, 0, 0, 0))
    
    0 讨论(0)
  • 2020-12-01 07:14

    Hi here is my version of the same, this function encodes all categorical variables which are 'factors' , and removes one of the dummy variables to avoid dummy variable trap and returns a new Data frame with the encoding :-

    onehotencoder <- function(df_orig) {
      df<-cbind(df_orig)
      df_clmtyp<-data.frame(clmtyp=sapply(df,class))
      df_col_typ<-data.frame(clmnm=colnames(df),clmtyp=df_clmtyp$clmtyp)
      for (rownm in 1:nrow(df_col_typ)) {
        if (df_col_typ[rownm,"clmtyp"]=="factor") {
          clmn_obj<-df[toString(df_col_typ[rownm,"clmnm"])] 
          dummy_matx<-data.frame(model.matrix( ~.-1, data = clmn_obj))
          dummy_matx<-dummy_matx[,c(1,3:ncol(dummy_matx))]
          df[toString(df_col_typ[rownm,"clmnm"])]<-NULL
          df<-cbind(df,dummy_matx)
          df[toString(df_col_typ[rownm,"clmnm"])]<-NULL
        }  }
      return(df)
    }
    
    0 讨论(0)
  • 2020-12-01 07:15

    I have a tidy solution that gives more control to user over the entire process. My solution has a JavaScript component that splits each cell and stores the column names as JSON. Then I use tidyjson::spread_all function to spread JSON into different column names.

    JavaScript component that you need to save as encoder.js:

    function oneHotSplitEncoder(inputStrArray, prefix, spliterRegExStr, spliterRegExStrOptions){
      if (Array.isArray(inputStrArray)) {
        return inputStrArray.map(function(str) {
          try{
            if(typeof(str) === 'string' && typeof(spliterRegExStr)==='string' && typeof(spliterRegExStrOptions)==='string' && typeof(prefix) === 'string'){
              return JSON.stringify(
                str.split(RegExp(spliterRegExStr, spliterRegExStrOptions))
                   .reduce(function(p, component){
                     p[prefix + component] = 1;
                       return p;
                   }, {})
              )
            } else {
              return NaN;
            }
          } catch (e) {
            console.warn("\n"+e+"\n"+str+"\n"+spliterRegExStr+' string expected')
            return NaN;
          }
        });
      } else {    
        console.warn("Error: oneHotSplitEncoder function needs array type inputs");
        return NaN;
      }
    };
    

    R components:

    library('dplyr')
    js <<- V8::v8(); 
    js$source("encoder.js");
    oneHotSplitEncoder <- function(inputStrArray, prefix, spliterRegExStr, spliterRegExStrOptions)
      js$call("oneHotSplitEncoder", inputStrArray, prefix, spliterRegExStr, spliterRegExStrOptions)
    
    df_one_hot <- df %>%
      mutate(
        fooColumn = oneHotSplitEncoder(fooColumn, 'prefix.', ' *[,;] *', 'g')
      ) %>%
      bind_cols(tidyjson::spread_all(.$fooColumn) %>% select(-document.id) %>% replace(is.na(.), 0))
    
    0 讨论(0)
  • 2020-12-01 07:19

    Here's a simple solution to one-hot-encode your category using no packages.

    Solution

    model.matrix(~0+category)

    It needs your categorical variable to be a factor. The factor levels must be the same in your training and test data, check with levels(train$category) and levels(test$category). It doesn't matter if some levels don't occur in your test set.

    Example

    Here's an example using the iris dataset.

    data(iris)
    #Split into train and test sets.
    train <- sample(1:nrow(iris),100)
    test <- -1*train
    
    iris[test,]
    
        Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
    34           5.5         4.2          1.4         0.2    setosa
    106          7.6         3.0          6.6         2.1 virginica
    112          6.4         2.7          5.3         1.9 virginica
    127          6.2         2.8          4.8         1.8 virginica
    132          7.9         3.8          6.4         2.0 virginica
    

    model.matrix() creates a column for each level of the factor, even if it is not present in the data. Zero indicates it is not that level, one indicates it is. Adding the zero specifies that you do not want an intercept or reference level and is equivalent to -1.

    oh_train <- model.matrix(~0+iris[train,'Species'])
    oh_test <- model.matrix(~0+iris[test,'Species'])
    
    #Renaming the columns to be more concise.
    attr(oh_test, "dimnames")[[2]] <- levels(iris$Species)
    
    
      setosa versicolor virginica
    1      1          0         0
    2      0          0         1
    3      0          0         1
    4      0          0         1
    5      0          0         1
    

    P.S. It's generally preferable to include all categories in training and test data. But that's none of my business.

    0 讨论(0)
  • 2020-12-01 07:26

    I recommend using the dummyVars function in the caret package:

    customers <- data.frame(
      id=c(10, 20, 30, 40, 50),
      gender=c('male', 'female', 'female', 'male', 'female'),
      mood=c('happy', 'sad', 'happy', 'sad','happy'),
      outcome=c(1, 1, 0, 0, 0))
    customers
    id gender  mood outcome
    1 10   male happy       1
    2 20 female   sad       1
    3 30 female happy       0
    4 40   male   sad       0
    5 50 female happy       0
    
    
    # dummify the data
    dmy <- dummyVars(" ~ .", data = customers)
    trsf <- data.frame(predict(dmy, newdata = customers))
    trsf
    id gender.female gender.male mood.happy mood.sad outcome
    1 10             0           1          1        0       1
    2 20             1           0          0        1       1
    3 30             1           0          1        0       0
    4 40             0           1          0        1       0
    5 50             1           0          1        0       0
    

    example source

    You apply the same procedure to both the training and validation sets.

    0 讨论(0)
提交回复
热议问题