using mean with .SD and .SDcols in data.table

血红的双手。 提交于 2019-12-21 02:52:27

问题


I am writing a very simple function to summarize columns of data.tables. I am passing one column at a time to the function, and then doing some diagnostics to figure out the options for summarization, and then doing the summarization. I am doing this in data.table to allow for some very large datasets.

So, I am using .SDcols to pass in the column to summarize, and using functions on .SD in the j part of a data.table expression. Since I am passing in one column at a time, I am not using lapply. And what I am finding is that some functions work and others do not. Below is a test dataset I am working with and the results I see:

dt <- data.table(
  a=1:10, 
  b=as.factor(letters[1:10]), 
  c=c(TRUE, FALSE), 
  d=runif(10, 0.5, 100), 
  e=c(0,1), 
  f=as.integer(c(0,1)), 
  g=as.numeric(1:10), 
  h=c("cat1", "cat2", "cat3", "cat4", "cat5"))

mean(dt$a)
[1] 5.5

dt[, mean(.SD), .SDcols = "a"]

[1] NA
Warning message:
In mean.default(.SD) : argument is not numeric or logical: returning NA

dt[, sum(.SD), .SDcols = "a"]
[1] 55

dt[, max(.SD), .SDcols = "a"]
[1] 10

dt[, colMeans(.SD), .SDcols = "a"]
  a 
5.5 

dt[, lapply(.SD, mean), .SDcols = "a"]
     a
1: 5.5

Interestingly, weighted.mean gives the wrong answer (55, the sum) when I use weighted.mean(.SD) in j. But when I use lapply(.SD, weighted.mean) in j, it gives the right answer (5.5, the mean).

I tried turning off data.table optimizations to see if it was the internal data.table mean function, but that didn't change things.

Maybe this is just a problem with using mean() on a list (which seems to be what .SD returns)? I guess there is never a reason to NOT use the lapply paradigm with .SD? It seems that only the lapply option returns a data.table. The others seem to return vectors, except for colMeans which is returning something else (list?).

My main question is why mean(.SD) does not work. And the corollary is whether .SD can be used in the absence of one of the apply functions.

Thanks.

来源:https://stackoverflow.com/questions/29568732/using-mean-with-sd-and-sdcols-in-data-table

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!