How to do row wise operations on .SD columns in data.table

早过忘川 提交于 2020-01-04 05:28:10

问题


Although I've figured this out before, I still find myself searching (and unable to find) this syntax on stackoverflow, so...

I want to do row wise operations on a subset of the data.table's columns, using .SD and .SDcols. I can never remember if the operations need an sapply, lapply, or if the belong inside the brackets of .SD.

As an example, say you have data for 10 students over two quarters. In both quarters they have two exams and a final exam. How would you take a straight average of the columns starting with q1?

Since overly trivial examples are annoying, I'd also like to calculate a weighted average for columns starting with q2? (weights = 25% 25% and 50% for q2)

library(data.table)

set.seed(10)
dt <- data.table(id = paste0("student_", sprintf("%02.f" , 1:10)),
                 q1_exam1 = round(rnorm(10, .78, .05), 2),
                 q1_exam2 = round(rnorm(10, .68, .02), 2),
                 q1_final = round(rnorm(10, .88, .08), 2),
                 q2_exam1 = round(rnorm(10, .78, .05), 2),
                 q2_exam2 = round(rnorm(10, .68, .10), 2),
                 q2_final = round(rnorm(10, .88, .04), 2))

dt
# > dt
#             id q1_exam1 q1_exam2 q1_final q2_exam1 q2_exam2 q2_final
#  1: student_01     0.78     0.70     0.83     0.69     0.79     0.86
#  2: student_02     0.77     0.70     0.71     0.78     0.60     0.87
#  3: student_03     0.71     0.68     0.83     0.83     0.60     0.93
#  4: student_04     0.75     0.70     0.71     0.79     0.76     0.97
#  5: student_05     0.79     0.69     0.78     0.71     0.58     0.90
#  6: student_06     0.80     0.68     0.85     0.71     0.68     0.91
#  7: student_07     0.72     0.66     0.82     0.80     0.70     0.84
#  8: student_08     0.76     0.68     0.81     0.69     0.65     0.90
#  9: student_09     0.70     0.70     0.87     0.76     0.61     0.85
# 10: student_10     0.77     0.69     0.86     0.75     0.75     0.89

回答1:


Here are a few thoughts on your options, largely gathered from the comments:

apply along rows

The OP's approach uses apply(.,1,.) for the by-row operation, but this is discouraged because it unnecessarily coerces the data.table into a matrix. lapply/sapply also are not suitable, since they are designed to work on each columns separately, not to combine them.

rowMeans and similarly-named functions also coerce to a matrix.

Split by rows

As @Jaap said, you can use by=1:nrow(dt) for any rowwise operation, but it may be slow.

Efficiently create new columns

This approach taken from eddi is probably the most efficient if you must keep your data in wide format:

jwts = list( 
  q1_AVG  = c(q1_exam1 = 1  , q1_exam2 = 1  , q1_final =   1)/3, 
  q2_WAVG = c(q1_exam1 = 1/4, q2_exam2 = 1/4, q2_final = 1/2)
)


for (newj in names(jwts)){
  w = jwts[[newj]]
  dt[, (newj) := Reduce("+", lapply(names(w), function(x) dt[[x]] * w[x]))]
}

This avoids coercion to a matrix and allows for different weighting rules (unlike rowMeans).

Go long

As @alexis_laz suggested, you might gain clarity and efficiency with a different structure, like

# reshape
m = melt(dt, id.vars="id", value.name="score")[,
  c("quarter","exam") := tstrsplit(variable, "_")][, variable := NULL]

# input your weighting rules
w = unique(m[,c("quarter","exam"), with=FALSE])
w[quarter=="q1"                , wt := 1/.N]
w[quarter=="q2" & exam=="final", wt := .5]
w[quarter=="q2" & exam!="final", wt := (1-.5)/.N]

# merge and compute
m[w, on=c("quarter","exam")][, sum(score*wt), by=.(id,quarter)]

This is what I would do.


In any case, you should have your weighting rules stored somewhere explicitly rather than entered on the fly if you want to scale up the number of quarters.




回答2:


In this case it is possible to use the apply function in base R, but that's not taking advantage of the data.table framework. Also, it doesn't generalize because there are cases which will require more conditional checking.

apply(dt[ , .SD, .SDcols = grep("^q1", colnames(dt))], 1, mean)

# > apply(dt[ , .SD, .SDcols = grep("^q1", colnames(dt))], 1, mean)
#  [1] 0.7700000 0.7266667 0.7400000 0.7200000 0.7533333 0.7766667 0.7333333 0.7500000 0.7566667 0.7733333

In this case, again it's possible to put apply into the j argument of the data.table, and use it on the .SD columns:

dt[i = TRUE,
   q1_AVG := round(apply(.SD, 1, mean), 2), 
   .SDcols = grep("^q1", colnames(dt))]
dt
# > dt
#             id q1_exam1 q1_exam2 q1_final q2_exam1 q2_exam2 q2_final q1_AVG
#  1: student_01     0.78     0.70     0.83     0.69     0.79     0.86   0.77
#  2: student_02     0.77     0.70     0.71     0.78     0.60     0.87   0.73
#  3: student_03     0.71     0.68     0.83     0.83     0.60     0.93   0.74
#  4: student_04     0.75     0.70     0.71     0.79     0.76     0.97   0.72
#  5: student_05     0.79     0.69     0.78     0.71     0.58     0.90   0.75
#  6: student_06     0.80     0.68     0.85     0.71     0.68     0.91   0.78
#  7: student_07     0.72     0.66     0.82     0.80     0.70     0.84   0.73
#  8: student_08     0.76     0.68     0.81     0.69     0.65     0.90   0.75
#  9: student_09     0.70     0.70     0.87     0.76     0.61     0.85   0.76
# 10: student_10     0.77     0.69     0.86     0.75     0.75     0.89   0.77

The case with the weighted average can be calculated using matrix multiplication;

dt[i = TRUE,
   q2_WAVG := round(as.matrix(.SD) %*% c(.25, .25, .50), 2), 
   .SDcols = grep("^q2", colnames(dt))]
dt
# > dt
#             id q1_exam1 q1_exam2 q1_final q2_exam1 q2_exam2 q2_final q1_AVG q2_WAVG
#  1: student_01     0.78     0.70     0.83     0.69     0.79     0.86   0.77    0.80
#  2: student_02     0.77     0.70     0.71     0.78     0.60     0.87   0.73    0.78
#  3: student_03     0.71     0.68     0.83     0.83     0.60     0.93   0.74    0.82
#  4: student_04     0.75     0.70     0.71     0.79     0.76     0.97   0.72    0.87
#  5: student_05     0.79     0.69     0.78     0.71     0.58     0.90   0.75    0.77
#  6: student_06     0.80     0.68     0.85     0.71     0.68     0.91   0.78    0.80
#  7: student_07     0.72     0.66     0.82     0.80     0.70     0.84   0.73    0.80
#  8: student_08     0.76     0.68     0.81     0.69     0.65     0.90   0.75    0.78
#  9: student_09     0.70     0.70     0.87     0.76     0.61     0.85   0.76    0.77
# 10: student_10     0.77     0.69     0.86     0.75     0.75     0.89   0.77    0.82


来源:https://stackoverflow.com/questions/33353036/how-to-do-row-wise-operations-on-sd-columns-in-data-table

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!