Creating multiple dummies from an existing data frame or data table

China☆狼群 提交于 2019-12-04 14:04:50

If you only need the four items in that list, you should just tabulate:

indcols <- paste0('index',1:3)
lapply(new[,indcols,with=FALSE],table) # counts
lapply(new[,indcols,with=FALSE],function(x)prop.table(table(x))) # means

# or...

lapply(
  new[,indcols,with=FALSE],
  function(x){
    z<-table(x)
    rbind(count=z,mean=prop.table(z))
  })

This gives

$index1
          a     b     c     d     e
count 200.0 200.0 200.0 200.0 200.0
mean    0.2   0.2   0.2   0.2   0.2

$index2
          f     g     h     i     j     k     l     m     n     o
count 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
mean    0.1   0.1   0.1   0.1   0.1   0.1   0.1   0.1   0.1   0.1

$index3
           p      q      r      s
count 250.00 250.00 250.00 250.00
mean    0.25   0.25   0.25   0.25


The previous approach would work on a data.frame or a data.table, but is rather complicated. With a data.table, one can use the melt syntax:

melt(new, id="id")[,.(
  N=.N, 
  mean=.N/nrow(new)
), by=.(variable,value)]

which gives

    variable value   N mean
 1:   index1     a 200 0.20
 2:   index1     b 200 0.20
 3:   index1     c 200 0.20
 4:   index1     d 200 0.20
 5:   index1     e 200 0.20
 6:   index2     f 100 0.10
 7:   index2     g 100 0.10
 8:   index2     h 100 0.10
 9:   index2     i 100 0.10
10:   index2     j 100 0.10
11:   index2     k 100 0.10
12:   index2     l 100 0.10
13:   index2     m 100 0.10
14:   index2     n 100 0.10
15:   index2     o 100 0.10
16:   index3     p 250 0.25
17:   index3     q 250 0.25
18:   index3     r 250 0.25
19:   index3     s 250 0.25

This approach was mentioned by @Arun in a comment (and implemented by him also, I think..?). To see how it works, first have a look at melt(new, id="id") which transforms the original data.table.

As mentioned in the comments, melting a data.table requires installing and loading reshape2 for some versions of the data.table package.



If you also need the dummies, they can be made in a loop as in the linked question:

newcols <- list()
for (i in indcols){
    vals = unique(new[[i]])
    newcols[[i]] = paste(vals,i,sep='_')
    new[,(newcols[[i]]):=lapply(vals,function(x)get(i)==x)]
}

This stores the groups of columns associated with each variable in newcols for convenience. If you wanted to do the tabulation just with these dummies (instead of the underlying variables as in solution above), you could do

lapply(
  indcols,
  function(i) new[,lapply(.SD,function(x){
    z <- sum(x)
    list(z,z/.N)
  }),.SDcols=newcols[[i]] ])

which gives a similar result. I just wrote it this way to illustrate how data.table syntax can be used. You could again avoid square brackets and .SD here:

lapply(
  indcols,
  function(i) sapply(
    new[, newcols[[i]], with=FALSE],
    function(x){
      z<-sum(x)
      rbind(z,z/length(x))
    }))

But anyway: just use table if you can hold onto the underlying variables.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!