Multiple functions on multiple columns by group, and create informative column names

后端未结

关注

 5  857

心在旅途 2020-12-16 19:17

How to adjust a data table manipulation so that, besides sum per category of several colums, it would also calculate other functions at the same time such as <

5条回答

粉色の甜心 (楼主)

2020-12-16 20:00

If the summary statistics you need to compute are things like mean, .N, and (perhaps) median, which data.table optimizes into c code across the by, you may have faster performance if you convert the table into long form so that you can do the computations in a way that data table can optimize them:

> library(data.table)
> n = 100000
> dt  = data.table(index=1:100000,
                   category = sample(letters[1:25], n, replace = T),
                   c1=rnorm(n,10000),
                   c2=rnorm(n,1000),
                   c3=rnorm(n,100),
                   c4 = rnorm(n,10)
  )
> {lapply(c(paste('c', 5:100, sep ='')), function(addcol) dt[[addcol]] <<- rnorm(n,1000) ); dt}

> Colchoice <- c("c1", "c4")

> dt[, .SD
     ][, c('index', 'category', Colchoice), with=F
     ][, melt(.SD, id.vars=c('index', 'category'))
     ][, mean := mean(value), .(category, variable)
     ][, median := median(value), .(category, variable)
     ][, N := .N, .(category, variable)
     ][, value := NULL
     ][, index := NULL
     ][, unique(.SD)
     ][, dcast(.SD, category ~ variable, value.var=c('mean', 'median', 'N') 
     ]

    category mean_c1 mean_c4 median_c1 median_c4 N_c1 N_c4
 1:        a   10000  10.021     10000    10.041 4128 4128
 2:        b   10000  10.012     10000    10.003 3942 3942
 3:        c   10000  10.005     10000     9.999 3926 3926
 4:        d   10000  10.002     10000    10.007 4046 4046
 5:        e   10000   9.974     10000     9.993 4037 4037
 6:        f   10000  10.025     10000    10.015 4009 4009
 7:        g   10000   9.994     10000     9.998 4012 4012
 8:        h   10000  10.007     10000     9.986 3950 3950
...

0 讨论(0)

查看其它5个回答