Multiple functions on multiple columns by group, and create informative column names

后端 未结 5 857
心在旅途
心在旅途 2020-12-16 19:17

How to adjust a data table manipulation so that, besides sum per category of several colums, it would also calculate other functions at the same time such as <

5条回答
  •  粉色の甜心
    2020-12-16 20:00

    If the summary statistics you need to compute are things like mean, .N, and (perhaps) median, which data.table optimizes into c code across the by, you may have faster performance if you convert the table into long form so that you can do the computations in a way that data table can optimize them:

    > library(data.table)
    > n = 100000
    > dt  = data.table(index=1:100000,
                       category = sample(letters[1:25], n, replace = T),
                       c1=rnorm(n,10000),
                       c2=rnorm(n,1000),
                       c3=rnorm(n,100),
                       c4 = rnorm(n,10)
      )
    > {lapply(c(paste('c', 5:100, sep ='')), function(addcol) dt[[addcol]] <<- rnorm(n,1000) ); dt}
    
    > Colchoice <- c("c1", "c4")
    
    > dt[, .SD
         ][, c('index', 'category', Colchoice), with=F
         ][, melt(.SD, id.vars=c('index', 'category'))
         ][, mean := mean(value), .(category, variable)
         ][, median := median(value), .(category, variable)
         ][, N := .N, .(category, variable)
         ][, value := NULL
         ][, index := NULL
         ][, unique(.SD)
         ][, dcast(.SD, category ~ variable, value.var=c('mean', 'median', 'N') 
         ]
    
        category mean_c1 mean_c4 median_c1 median_c4 N_c1 N_c4
     1:        a   10000  10.021     10000    10.041 4128 4128
     2:        b   10000  10.012     10000    10.003 3942 3942
     3:        c   10000  10.005     10000     9.999 3926 3926
     4:        d   10000  10.002     10000    10.007 4046 4046
     5:        e   10000   9.974     10000     9.993 4037 4037
     6:        f   10000  10.025     10000    10.015 4009 4009
     7:        g   10000   9.994     10000     9.998 4012 4012
     8:        h   10000  10.007     10000     9.986 3950 3950
    ...
    

提交回复
热议问题