If I have the following data.frame, how would I go about creating a dummy variable for each year and attach it to DF so there would be additional columns year2010 and year2011. I have a fairly large dataset with many different years and I don't want to use ifelse 50 times. ddply?
Thanks
DF <- read.table(text=" year id var ans 2010 1 1 1 2010 2 0 0 2010 1 0 1 2010 1 0 1 2011 2 1 1 2011 2 0 1 2011 1 0 0 2011 1 0 0", header=TRUE)
Desired output :
year id var ans year_2010 year_2011 1 2010 1 1 1 1 0 2 2010 2 0 0 1 0 3 2010 1 0 1 1 0 4 2010 1 0 1 1 0 5 2011 2 1 1 0 1 6 2011 2 0 1 0 1 7 2011 1 0 0 0 1 8 2011 1 0 0 0 1
1
Just use table
, like this:
cbind(DF, as.data.frame.matrix(table(sequence(nrow(DF)), DF$year))) year id var ans 2010 2011 1 2010 1 1 1 1 0 2 2010 2 0 0 1 0 3 2010 1 0 1 1 0 4 2010 1 0 1 1 0 5 2011 2 1 1 0 1 6 2011 2 0 1 0 1 7 2011 1 0 0 0 1 8 2011 1 0 0 0 1
You should also be able to do something like this:
library(data.table) cbind(DF, dcast.data.table(as.data.table(DF, keep.rownames = TRUE), rn ~ year, value.var = "id", fun.aggregate = length)) # year id var ans rn 2010 2011 # 1 2010 1 1 1 1 1 0 # 2 2010 2 0 0 2 1 0 # 3 2010 1 0 1 3 1 0 # 4 2010 1 0 1 4 1 0 # 5 2011 2 1 1 5 0 1 # 6 2011 2 0 1 6 0 1 # 7 2011 1 0 0 7 0 1 # 8 2011 1 0 0 8 0 1
If you want the names to be "year_2010" and so on, I guess a workaround would be to do something like this:
dcast.data.table(as.data.table(DF, keep.rownames = TRUE)[, yr := "year"], rn ~ yr + year, value.var = "id", fun.aggregate = length)
You can also always write your own function. Here's one I've whipped together that should be reasonably efficient:
dummyCreator <- function(invec, prefix = NULL) { L <- length(invec) ColNames <- sort(unique(invec)) M <- matrix(0L, ncol = length(ColNames), nrow = L, dimnames = list(NULL, ColNames)) M[cbind(seq_len(L), match(invec, ColNames))] <- 1L if (!is.null(prefix)) colnames(M) <- paste(prefix, colnames(M), sep = "_") M } dummyCreator(DF$year, prefix = "year") # year_2010 year_2011 # [1,] 1 0 # [2,] 1 0 # [3,] 1 0 # [4,] 1 0 # [5,] 0 1 # [6,] 0 1 # [7,] 0 1 # [8,] 0 1
Just use cbind
as above to get the output you expect.
Here is my favorite code for creating dummy variables from a categorical variable. The only difference is that this code produces K-1
dummy variable to avoid colinearity:
x = as.factor( rep(1:6,each=4) ); model.matrix(~x)[,-1]
Substitute x
with the year from your data set.
maybe this?
library(tidyr) DF$row <- 1:nrow(DF) # to make each row unique DF$dummy <- 1 newdf <- spread(DF, year, dummy, fill = 0)
for(i in unique(DF$year)) { DF[paste('year',i,sep="")]=DF$year==i }
As Andrey Shabalin mentioned, you want model.matrix
. First you need to convert the year
column to be a factor. To get exactly what you want, you need to use contr.ltfr
, a modified version of contr.treatment
in the caret
package.
In the formula below, 0
means don't use an intercept and .
represents all the columns in the data frame.
DF$year <- factor(DF$year) model.matrix( ~ 0 + ., DF, contrasts.arg = list(year = "contr.ltfr") )