How to one hot encode several categorical variables in R

前端未结

关注

 5  934

I\'m working on a prediction problem and I\'m building a decision tree in R, I have several categorical variables and I\'d like to one-hot encode them consistently in my tra

Code

library(data.table)
library(mltools)
customers_1h <- one_hot(as.data.table(customers))

Result

> customers_1h
id gender_female gender_male mood_happy mood_sad outcome
1: 10             0           1          1        0       1
2: 20             1           0          0        1       1
3: 30             1           0          1        0       0
4: 40             0           1          0        1       0
5: 50             1           0          1        0       0

Data

customers <- data.frame(
  id=c(10, 20, 30, 40, 50),
  gender=c('male', 'female', 'female', 'male', 'female'),
  mood=c('happy', 'sad', 'happy', 'sad','happy'),
  outcome=c(1, 1, 0, 0, 0))

0 讨论(0)

长发绾君心

2020-12-01 07:14

Hi here is my version of the same, this function encodes all categorical variables which are 'factors' , and removes one of the dummy variables to avoid dummy variable trap and returns a new Data frame with the encoding :-

onehotencoder <- function(df_orig) {
  df<-cbind(df_orig)
  df_clmtyp<-data.frame(clmtyp=sapply(df,class))
  df_col_typ<-data.frame(clmnm=colnames(df),clmtyp=df_clmtyp$clmtyp)
  for (rownm in 1:nrow(df_col_typ)) {
    if (df_col_typ[rownm,"clmtyp"]=="factor") {
      clmn_obj<-df[toString(df_col_typ[rownm,"clmnm"])] 
      dummy_matx<-data.frame(model.matrix( ~.-1, data = clmn_obj))
      dummy_matx<-dummy_matx[,c(1,3:ncol(dummy_matx))]
      df[toString(df_col_typ[rownm,"clmnm"])]<-NULL
      df<-cbind(df,dummy_matx)
      df[toString(df_col_typ[rownm,"clmnm"])]<-NULL
    }  }
  return(df)
}

0 讨论(0)

情歌与酒

2020-12-01 07:15

I have a tidy solution that gives more control to user over the entire process. My solution has a JavaScript component that splits each cell and stores the column names as JSON. Then I use tidyjson::spread_all function to spread JSON into different column names.

JavaScript component that you need to save as encoder.js:

function oneHotSplitEncoder(inputStrArray, prefix, spliterRegExStr, spliterRegExStrOptions){
  if (Array.isArray(inputStrArray)) {
    return inputStrArray.map(function(str) {
      try{
        if(typeof(str) === 'string' && typeof(spliterRegExStr)==='string' && typeof(spliterRegExStrOptions)==='string' && typeof(prefix) === 'string'){
          return JSON.stringify(
            str.split(RegExp(spliterRegExStr, spliterRegExStrOptions))
               .reduce(function(p, component){
                 p[prefix + component] = 1;
                   return p;
               }, {})
          )
        } else {
          return NaN;
        }
      } catch (e) {
        console.warn("\n"+e+"\n"+str+"\n"+spliterRegExStr+' string expected')
        return NaN;
      }
    });
  } else {    
    console.warn("Error: oneHotSplitEncoder function needs array type inputs");
    return NaN;
  }
};

R components:

library('dplyr')
js <<- V8::v8(); 
js$source("encoder.js");
oneHotSplitEncoder <- function(inputStrArray, prefix, spliterRegExStr, spliterRegExStrOptions)
  js$call("oneHotSplitEncoder", inputStrArray, prefix, spliterRegExStr, spliterRegExStrOptions)

df_one_hot <- df %>%
  mutate(
    fooColumn = oneHotSplitEncoder(fooColumn, 'prefix.', ' *[,;] *', 'g')
  ) %>%
  bind_cols(tidyjson::spread_all(.$fooColumn) %>% select(-document.id) %>% replace(is.na(.), 0))

0 讨论(0)

我在风中等你

2020-12-01 07:19

Here's a simple solution to one-hot-encode your category using no packages.

Solution

model.matrix(~0+category)

It needs your categorical variable to be a factor. The factor levels must be the same in your training and test data, check with levels(train$category) and levels(test$category). It doesn't matter if some levels don't occur in your test set.

Example

Here's an example using the iris dataset.

data(iris)
#Split into train and test sets.
train <- sample(1:nrow(iris),100)
test <- -1*train

iris[test,]

    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
34           5.5         4.2          1.4         0.2    setosa
106          7.6         3.0          6.6         2.1 virginica
112          6.4         2.7          5.3         1.9 virginica
127          6.2         2.8          4.8         1.8 virginica
132          7.9         3.8          6.4         2.0 virginica

model.matrix() creates a column for each level of the factor, even if it is not present in the data. Zero indicates it is not that level, one indicates it is. Adding the zero specifies that you do not want an intercept or reference level and is equivalent to -1.

oh_train <- model.matrix(~0+iris[train,'Species'])
oh_test <- model.matrix(~0+iris[test,'Species'])

#Renaming the columns to be more concise.
attr(oh_test, "dimnames")[[2]] <- levels(iris$Species)


  setosa versicolor virginica
1      1          0         0
2      0          0         1
3      0          0         1
4      0          0         1
5      0          0         1

P.S. It's generally preferable to include all categories in training and test data. But that's none of my business.

0 讨论(0)

长情又很酷

2020-12-01 07:26

I recommend using the dummyVars function in the caret package:

customers <- data.frame(
  id=c(10, 20, 30, 40, 50),
  gender=c('male', 'female', 'female', 'male', 'female'),
  mood=c('happy', 'sad', 'happy', 'sad','happy'),
  outcome=c(1, 1, 0, 0, 0))
customers
id gender  mood outcome
1 10   male happy       1
2 20 female   sad       1
3 30 female happy       0
4 40   male   sad       0
5 50 female happy       0


# dummify the data
dmy <- dummyVars(" ~ .", data = customers)
trsf <- data.frame(predict(dmy, newdata = customers))
trsf
id gender.female gender.male mood.happy mood.sad outcome
1 10             0           1          1        0       1
2 20             1           0          0        1       1
3 30             1           0          1        0       0
4 40             0           1          0        1       0
5 50             1           0          1        0       0

example source

You apply the same procedure to both the training and validation sets.

0 讨论(0)