How to adjust a data table manipulation so that, besides sum
per category of several colums,
it would also calculate other functions at the same time such as <
If the summary statistics you need to compute are things like mean
, .N
, and (perhaps) median
, which data.table
optimizes into c code across the by, you may have faster performance if you convert the table into long form so that you can do the computations in a way that data table can optimize them:
> library(data.table)
> n = 100000
> dt = data.table(index=1:100000,
category = sample(letters[1:25], n, replace = T),
c1=rnorm(n,10000),
c2=rnorm(n,1000),
c3=rnorm(n,100),
c4 = rnorm(n,10)
)
> {lapply(c(paste('c', 5:100, sep ='')), function(addcol) dt[[addcol]] <<- rnorm(n,1000) ); dt}
> Colchoice <- c("c1", "c4")
> dt[, .SD
][, c('index', 'category', Colchoice), with=F
][, melt(.SD, id.vars=c('index', 'category'))
][, mean := mean(value), .(category, variable)
][, median := median(value), .(category, variable)
][, N := .N, .(category, variable)
][, value := NULL
][, index := NULL
][, unique(.SD)
][, dcast(.SD, category ~ variable, value.var=c('mean', 'median', 'N')
]
category mean_c1 mean_c4 median_c1 median_c4 N_c1 N_c4
1: a 10000 10.021 10000 10.041 4128 4128
2: b 10000 10.012 10000 10.003 3942 3942
3: c 10000 10.005 10000 9.999 3926 3926
4: d 10000 10.002 10000 10.007 4046 4046
5: e 10000 9.974 10000 9.993 4037 4037
6: f 10000 10.025 10000 10.015 4009 4009
7: g 10000 9.994 10000 9.998 4012 4012
8: h 10000 10.007 10000 9.986 3950 3950
...