I have a variable that is a factor :
$ year : Factor w/ 8 levels \"2003\",\"2004\",..: 4 6 4 2 4 1 3 3 7 2 ...
I would like to c
library(caret) provides a very simple function (dummyVars) to create dummy variables, especially when you have more than one factor variables. But you have to make sure the target variables are factor. e.g. if your Sales$year are numeric, you have to convert them to factor: as.factor(Sales$year)
Suppose we have the original dataset 'Sales' as follows:
year Sales Region
1 2010 3695.543 North
2 2010 9873.037 West
3 2008 3579.458 West
4 2005 2788.857 North
5 2005 2952.183 North
6 2008 7255.337 West
7 2005 5237.081 West
8 2010 8987.096 North
9 2008 5545.343 North
10 2008 1809.446 West
Now we can create two dummy variables simultaneously:
>library(lattice)
>library(ggplot2)
>library(caret)
>Salesdummy <- dummyVars(~., data = Sales, levelsOnly = TRUE)
>Sdummy <- predict(Salesdummy, Sales)
The outcome will be:
2005 2008 2010 Sales RegionNorth RegionWest
1 0 0 1 3695.543 1 0
2 0 0 1 9873.037 0 1
3 0 1 0 3579.458 0 1
4 1 0 0 2788.857 1 0
5 1 0 0 2952.183 1 0
6 0 1 0 7255.337 0 1
7 1 0 0 5237.081 0 1
8 0 0 1 8987.096 1 0
9 0 1 0 5545.343 1 0
10 0 1 0 1809.446 0 1
This is as concise as I could get. The na.action option takes care of the NA values (I would rather do this with an argument than with a global options setting, but I can't see how). The naming of columns is pretty deeply hard-coded, don't see any way to override it within model.matrix ...
options(na.action=na.pass)
dt1 <- data.frame(year=factor(c(NA,2003:2005)))
dt2 <- setNames(cbind(dt1,model.matrix(~year-1,data=dt1)),
c("year",levels(dt1$year)))
As pointed out above, you may run into trouble in some contexts with column names that are not legal R variable names.
year 2003 2004 2005
1 <NA> NA NA NA
2 2003 1 0 0
3 2004 0 1 0
4 2005 0 0 1
You could use ifelse() which won't omit na rows (but I guess you might not count it as being "as concise as possible"):
dt1 <- data.frame(year=factor(rep(2003:2010, 10))) # example data
dt1 <- within(dt1, yr2003<-ifelse(year=="2003", 1, 0))
dt1 <- within(dt1, yr2004<-ifelse(year=="2004", 1, 0))
dt1 <- within(dt1, yr2005<-ifelse(year=="2005", 1, 0))
# ...
head(dt1)
# year yr2003 yr2004 yr2005
# 1 2003 1 0 0
# 2 2004 0 1 0
# 3 2005 0 0 1
# 4 2006 0 0 0
# 5 2007 0 0 0
# 6 2008 0 0 0