问题
Although I've figured this out before, I still find myself searching (and unable to find) this syntax on stackoverflow, so...
I want to do row wise operations on a subset of the data.table's columns, using .SD
and .SDcols
. I can never remember if the operations need an sapply
, lapply
, or if the belong inside the brackets of .SD
.
As an example, say you have data for 10 students over two quarters. In both quarters they have two exams and a final exam. How would you take a straight average of the columns starting with q1?
Since overly trivial examples are annoying, I'd also like to calculate a weighted average for columns starting with q2? (weights = 25% 25% and 50% for q2)
library(data.table)
set.seed(10)
dt <- data.table(id = paste0("student_", sprintf("%02.f" , 1:10)),
q1_exam1 = round(rnorm(10, .78, .05), 2),
q1_exam2 = round(rnorm(10, .68, .02), 2),
q1_final = round(rnorm(10, .88, .08), 2),
q2_exam1 = round(rnorm(10, .78, .05), 2),
q2_exam2 = round(rnorm(10, .68, .10), 2),
q2_final = round(rnorm(10, .88, .04), 2))
dt
# > dt
# id q1_exam1 q1_exam2 q1_final q2_exam1 q2_exam2 q2_final
# 1: student_01 0.78 0.70 0.83 0.69 0.79 0.86
# 2: student_02 0.77 0.70 0.71 0.78 0.60 0.87
# 3: student_03 0.71 0.68 0.83 0.83 0.60 0.93
# 4: student_04 0.75 0.70 0.71 0.79 0.76 0.97
# 5: student_05 0.79 0.69 0.78 0.71 0.58 0.90
# 6: student_06 0.80 0.68 0.85 0.71 0.68 0.91
# 7: student_07 0.72 0.66 0.82 0.80 0.70 0.84
# 8: student_08 0.76 0.68 0.81 0.69 0.65 0.90
# 9: student_09 0.70 0.70 0.87 0.76 0.61 0.85
# 10: student_10 0.77 0.69 0.86 0.75 0.75 0.89
回答1:
Here are a few thoughts on your options, largely gathered from the comments:
apply
along rows
The OP's approach uses apply(.,1,.)
for the by-row operation, but this is discouraged because it unnecessarily coerces the data.table into a matrix. lapply
/sapply
also are not suitable, since they are designed to work on each columns separately, not to combine them.
rowMeans
and similarly-named functions also coerce to a matrix.
Split by rows
As @Jaap said, you can use by=1:nrow(dt)
for any rowwise operation, but it may be slow.
Efficiently create new columns
This approach taken from eddi is probably the most efficient if you must keep your data in wide format:
jwts = list(
q1_AVG = c(q1_exam1 = 1 , q1_exam2 = 1 , q1_final = 1)/3,
q2_WAVG = c(q1_exam1 = 1/4, q2_exam2 = 1/4, q2_final = 1/2)
)
for (newj in names(jwts)){
w = jwts[[newj]]
dt[, (newj) := Reduce("+", lapply(names(w), function(x) dt[[x]] * w[x]))]
}
This avoids coercion to a matrix and allows for different weighting rules (unlike rowMeans
).
Go long
As @alexis_laz suggested, you might gain clarity and efficiency with a different structure, like
# reshape
m = melt(dt, id.vars="id", value.name="score")[,
c("quarter","exam") := tstrsplit(variable, "_")][, variable := NULL]
# input your weighting rules
w = unique(m[,c("quarter","exam"), with=FALSE])
w[quarter=="q1" , wt := 1/.N]
w[quarter=="q2" & exam=="final", wt := .5]
w[quarter=="q2" & exam!="final", wt := (1-.5)/.N]
# merge and compute
m[w, on=c("quarter","exam")][, sum(score*wt), by=.(id,quarter)]
This is what I would do.
In any case, you should have your weighting rules stored somewhere explicitly rather than entered on the fly if you want to scale up the number of quarters.
回答2:
In this case it is possible to use the apply
function in base R, but that's not taking advantage of the data.table
framework. Also, it doesn't generalize because there are cases which will require more conditional checking.
apply(dt[ , .SD, .SDcols = grep("^q1", colnames(dt))], 1, mean)
# > apply(dt[ , .SD, .SDcols = grep("^q1", colnames(dt))], 1, mean)
# [1] 0.7700000 0.7266667 0.7400000 0.7200000 0.7533333 0.7766667 0.7333333 0.7500000 0.7566667 0.7733333
In this case, again it's possible to put apply into the j
argument of the data.table, and use it on the .SD
columns:
dt[i = TRUE,
q1_AVG := round(apply(.SD, 1, mean), 2),
.SDcols = grep("^q1", colnames(dt))]
dt
# > dt
# id q1_exam1 q1_exam2 q1_final q2_exam1 q2_exam2 q2_final q1_AVG
# 1: student_01 0.78 0.70 0.83 0.69 0.79 0.86 0.77
# 2: student_02 0.77 0.70 0.71 0.78 0.60 0.87 0.73
# 3: student_03 0.71 0.68 0.83 0.83 0.60 0.93 0.74
# 4: student_04 0.75 0.70 0.71 0.79 0.76 0.97 0.72
# 5: student_05 0.79 0.69 0.78 0.71 0.58 0.90 0.75
# 6: student_06 0.80 0.68 0.85 0.71 0.68 0.91 0.78
# 7: student_07 0.72 0.66 0.82 0.80 0.70 0.84 0.73
# 8: student_08 0.76 0.68 0.81 0.69 0.65 0.90 0.75
# 9: student_09 0.70 0.70 0.87 0.76 0.61 0.85 0.76
# 10: student_10 0.77 0.69 0.86 0.75 0.75 0.89 0.77
The case with the weighted average can be calculated using matrix multiplication;
dt[i = TRUE,
q2_WAVG := round(as.matrix(.SD) %*% c(.25, .25, .50), 2),
.SDcols = grep("^q2", colnames(dt))]
dt
# > dt
# id q1_exam1 q1_exam2 q1_final q2_exam1 q2_exam2 q2_final q1_AVG q2_WAVG
# 1: student_01 0.78 0.70 0.83 0.69 0.79 0.86 0.77 0.80
# 2: student_02 0.77 0.70 0.71 0.78 0.60 0.87 0.73 0.78
# 3: student_03 0.71 0.68 0.83 0.83 0.60 0.93 0.74 0.82
# 4: student_04 0.75 0.70 0.71 0.79 0.76 0.97 0.72 0.87
# 5: student_05 0.79 0.69 0.78 0.71 0.58 0.90 0.75 0.77
# 6: student_06 0.80 0.68 0.85 0.71 0.68 0.91 0.78 0.80
# 7: student_07 0.72 0.66 0.82 0.80 0.70 0.84 0.73 0.80
# 8: student_08 0.76 0.68 0.81 0.69 0.65 0.90 0.75 0.78
# 9: student_09 0.70 0.70 0.87 0.76 0.61 0.85 0.76 0.77
# 10: student_10 0.77 0.69 0.86 0.75 0.75 0.89 0.77 0.82
来源:https://stackoverflow.com/questions/33353036/how-to-do-row-wise-operations-on-sd-columns-in-data-table