I ended up with a big data.table and I have to do operations per row. (yes... I know that this is clearly not what data.table are for)
R) set.seed(1)
R) DT=d
Creating .SD
on each row could be a very costly operation, especially if your data.table consists of rows >> columns
. I'd advice using pmin
and pmax
across columns with a loop. I'll illustrate this with a bigger data (along the rows).
set.seed(1)
require(data.table)
DT1 <- data.table(matrix(rnorm(1e6),ncol=10))
DT1[, a := 1:1e5]
DT2 <- copy(DT1)
DT3 <- copy(DT1)
arun <- function(DT) {
# assign first column (dummy)
DT[, `:=`(min = DT[, V1], max = DT[, V1])]
# get all other column names and use pmin and pmax
# and replace min and max columns
cols <- names(DT)[2:10]
for (i in cols) {
DT[, `:=`(min = pmin(min, DT[[i]]), max = pmax(max, DT[[i]]))]
}
DT
}
eddi <- function(DT) {
DT[, `:=`(min = min(.SD), max = max(.SD)), by = a, .SDcols = paste0("V", 1:10)]
}
frank <- function(DT) {
cols <- names(DT)[grepl('^V[[:digit:]]+$',names(DT))]
newcols <- c("min","max")
myfun <- range
DT[,(newcols):=as.list(myfun(.SD)),.SDcols=cols,by=1:nrow(DT)]
}
require(microbenchmark)
microbenchmark(o1 <- arun(DT1), o2 <- eddi(DT2), o3 <- frank(DT3), times=2)
Unit: milliseconds
expr min lq median uq max neval
o1 <- arun(DT1) 204.4417 204.4417 250.5205 296.5992 296.5992 2
o2 <- eddi(DT2) 92343.5321 92343.5321 96706.1622 101068.7923 101068.7923 2
o3 <- frank(DT3) 49083.7000 49083.7000 49521.9296 49960.1592 49960.1592 2
identical(o1, o2) # TRUE
identical(o1, o3) # TRUE
--
As @Frank points out under comments, you could replace the for-loop with do.call
as:
DT[, c("min", "max") := { z <- dt[, 1:10];
list(do.call(pmin, z), do.call(pmax, z))}]
This spells out the steps in case you want to use a different function:
cols <- names(DT)[grepl('^V[[:digit:]]+$',names(DT))]
newcols <- c("min","max")
myfun <- range
DT[,(newcols):=as.list(myfun(.SD)),.SDcols=cols,by=1:nrow(DT)]
Am I missing something, doesn't this give the min across row
set.seed(1)
DT=data.table(matrix(rnorm(100),nrow=10))
DT[,c('a','b'):=list(1:10,2:11)]
DT
cols<-c("V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10")
Method 1
DT[,Min_Vi:=do.call(pmin, c(.SD, na.rm=TRUE)), .SDcols=cols]
Method 2
transform(DT,Min_Vi=pmin(get(cols)))
Since you already have the row numbers as a column in your data.table
*, you could just do:
DT[, `:=`(a1 = max(.SD), a2 = min(.SD)), by = a, .SDcols = paste0("V", 1:10)]
or
setkey(DT, a)
DT[J(a), `:=`(a1 = max(.SD), a2 = min(.SD)), .SDcols = paste0("V", 1:10)]
The second option uses the silent by-without-by.
*of course you could also just use row.names
or 1:nrow(DT)