I\'m working on a prediction problem and I\'m building a decision tree in R, I have several categorical variables and I\'d like to one-hot encode them consistently in my tra
library(data.table)
library(mltools)
customers_1h <- one_hot(as.data.table(customers))
> customers_1h
id gender_female gender_male mood_happy mood_sad outcome
1: 10 0 1 1 0 1
2: 20 1 0 0 1 1
3: 30 1 0 1 0 0
4: 40 0 1 0 1 0
5: 50 1 0 1 0 0
customers <- data.frame(
id=c(10, 20, 30, 40, 50),
gender=c('male', 'female', 'female', 'male', 'female'),
mood=c('happy', 'sad', 'happy', 'sad','happy'),
outcome=c(1, 1, 0, 0, 0))
Hi here is my version of the same, this function encodes all categorical variables which are 'factors' , and removes one of the dummy variables to avoid dummy variable trap and returns a new Data frame with the encoding :-
onehotencoder <- function(df_orig) {
df<-cbind(df_orig)
df_clmtyp<-data.frame(clmtyp=sapply(df,class))
df_col_typ<-data.frame(clmnm=colnames(df),clmtyp=df_clmtyp$clmtyp)
for (rownm in 1:nrow(df_col_typ)) {
if (df_col_typ[rownm,"clmtyp"]=="factor") {
clmn_obj<-df[toString(df_col_typ[rownm,"clmnm"])]
dummy_matx<-data.frame(model.matrix( ~.-1, data = clmn_obj))
dummy_matx<-dummy_matx[,c(1,3:ncol(dummy_matx))]
df[toString(df_col_typ[rownm,"clmnm"])]<-NULL
df<-cbind(df,dummy_matx)
df[toString(df_col_typ[rownm,"clmnm"])]<-NULL
} }
return(df)
}
I have a tidy solution that gives more control to user over the entire process. My solution has a JavaScript component that splits each cell and stores the column names as JSON. Then I use tidyjson::spread_all function to spread JSON into different column names.
JavaScript component that you need to save as encoder.js:
function oneHotSplitEncoder(inputStrArray, prefix, spliterRegExStr, spliterRegExStrOptions){
if (Array.isArray(inputStrArray)) {
return inputStrArray.map(function(str) {
try{
if(typeof(str) === 'string' && typeof(spliterRegExStr)==='string' && typeof(spliterRegExStrOptions)==='string' && typeof(prefix) === 'string'){
return JSON.stringify(
str.split(RegExp(spliterRegExStr, spliterRegExStrOptions))
.reduce(function(p, component){
p[prefix + component] = 1;
return p;
}, {})
)
} else {
return NaN;
}
} catch (e) {
console.warn("\n"+e+"\n"+str+"\n"+spliterRegExStr+' string expected')
return NaN;
}
});
} else {
console.warn("Error: oneHotSplitEncoder function needs array type inputs");
return NaN;
}
};
R components:
library('dplyr')
js <<- V8::v8();
js$source("encoder.js");
oneHotSplitEncoder <- function(inputStrArray, prefix, spliterRegExStr, spliterRegExStrOptions)
js$call("oneHotSplitEncoder", inputStrArray, prefix, spliterRegExStr, spliterRegExStrOptions)
df_one_hot <- df %>%
mutate(
fooColumn = oneHotSplitEncoder(fooColumn, 'prefix.', ' *[,;] *', 'g')
) %>%
bind_cols(tidyjson::spread_all(.$fooColumn) %>% select(-document.id) %>% replace(is.na(.), 0))
Here's a simple solution to one-hot-encode your category using no packages.
model.matrix(~0+category)
It needs your categorical variable to be a factor. The factor levels must be the same in your training and test data, check with levels(train$category)
and levels(test$category)
. It doesn't matter if some levels don't occur in your test set.
Here's an example using the iris dataset.
data(iris)
#Split into train and test sets.
train <- sample(1:nrow(iris),100)
test <- -1*train
iris[test,]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
34 5.5 4.2 1.4 0.2 setosa
106 7.6 3.0 6.6 2.1 virginica
112 6.4 2.7 5.3 1.9 virginica
127 6.2 2.8 4.8 1.8 virginica
132 7.9 3.8 6.4 2.0 virginica
model.matrix()
creates a column for each level of the factor, even if it is not present in the data. Zero indicates it is not that level, one indicates it is. Adding the zero specifies that you do not want an intercept or reference level and is equivalent to -1.
oh_train <- model.matrix(~0+iris[train,'Species'])
oh_test <- model.matrix(~0+iris[test,'Species'])
#Renaming the columns to be more concise.
attr(oh_test, "dimnames")[[2]] <- levels(iris$Species)
setosa versicolor virginica
1 1 0 0
2 0 0 1
3 0 0 1
4 0 0 1
5 0 0 1
P.S. It's generally preferable to include all categories in training and test data. But that's none of my business.
I recommend using the dummyVars function in the caret package:
customers <- data.frame(
id=c(10, 20, 30, 40, 50),
gender=c('male', 'female', 'female', 'male', 'female'),
mood=c('happy', 'sad', 'happy', 'sad','happy'),
outcome=c(1, 1, 0, 0, 0))
customers
id gender mood outcome
1 10 male happy 1
2 20 female sad 1
3 30 female happy 0
4 40 male sad 0
5 50 female happy 0
# dummify the data
dmy <- dummyVars(" ~ .", data = customers)
trsf <- data.frame(predict(dmy, newdata = customers))
trsf
id gender.female gender.male mood.happy mood.sad outcome
1 10 0 1 1 0 1
2 20 1 0 0 1 1
3 30 1 0 1 0 0
4 40 0 1 0 1 0
5 50 1 0 1 0 0
example source
You apply the same procedure to both the training and validation sets.