row-by-row operations and updates in data.table

前端 未结 4 1172
没有蜡笔的小新
没有蜡笔的小新 2020-12-09 10:39

I ended up with a big data.table and I have to do operations per row. (yes... I know that this is clearly not what data.table are for)

R) set.seed(1)
R) DT=d         


        
相关标签:
4条回答
  • 2020-12-09 11:20

    Creating .SD on each row could be a very costly operation, especially if your data.table consists of rows >> columns. I'd advice using pmin and pmax across columns with a loop. I'll illustrate this with a bigger data (along the rows).

    Data:

    set.seed(1)
    require(data.table)
    DT1 <- data.table(matrix(rnorm(1e6),ncol=10))
    DT1[, a := 1:1e5]
    DT2 <- copy(DT1)
    DT3 <- copy(DT1)
    

    Functions:

    arun <- function(DT) {
        # assign first column (dummy)
        DT[, `:=`(min = DT[, V1], max = DT[, V1])]
        # get all other column names and use pmin and pmax 
        # and replace min and max columns
        cols <- names(DT)[2:10]
        for (i in cols) {
            DT[, `:=`(min = pmin(min, DT[[i]]), max = pmax(max, DT[[i]]))]
        }
        DT
    }
    
    eddi <- function(DT) {
        DT[, `:=`(min = min(.SD), max = max(.SD)), by = a, .SDcols = paste0("V", 1:10)]
    }
    
    frank <- function(DT) {
        cols    <- names(DT)[grepl('^V[[:digit:]]+$',names(DT))]
        newcols <- c("min","max")
        myfun   <- range
        DT[,(newcols):=as.list(myfun(.SD)),.SDcols=cols,by=1:nrow(DT)]
    }
    

    Benchmarking:

    require(microbenchmark)
    microbenchmark(o1 <- arun(DT1), o2 <- eddi(DT2), o3 <- frank(DT3), times=2)
    
    Unit: milliseconds
                 expr        min         lq     median          uq         max neval
      o1 <- arun(DT1)   204.4417   204.4417   250.5205    296.5992    296.5992     2
      o2 <- eddi(DT2) 92343.5321 92343.5321 96706.1622 101068.7923 101068.7923     2
     o3 <- frank(DT3) 49083.7000 49083.7000 49521.9296  49960.1592  49960.1592     2
    
    identical(o1, o2) # TRUE
    identical(o1, o3) # TRUE
    

    --

    As @Frank points out under comments, you could replace the for-loop with do.call as:

    DT[, c("min", "max") := { z <- dt[, 1:10]; 
                 list(do.call(pmin, z), do.call(pmax, z))}]
    
    0 讨论(0)
  • 2020-12-09 11:41

    This spells out the steps in case you want to use a different function:

    cols    <- names(DT)[grepl('^V[[:digit:]]+$',names(DT))]
    newcols <- c("min","max")
    myfun   <- range
    DT[,(newcols):=as.list(myfun(.SD)),.SDcols=cols,by=1:nrow(DT)]
    
    0 讨论(0)
  • 2020-12-09 11:42

    Am I missing something, doesn't this give the min across row

    set.seed(1)
    DT=data.table(matrix(rnorm(100),nrow=10))
    DT[,c('a','b'):=list(1:10,2:11)]
    DT
    cols<-c("V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10")
    

    Method 1

    DT[,Min_Vi:=do.call(pmin, c(.SD, na.rm=TRUE)), .SDcols=cols]
    

    Method 2

    transform(DT,Min_Vi=pmin(get(cols)))
    
    0 讨论(0)
  • 2020-12-09 11:46

    Since you already have the row numbers as a column in your data.table*, you could just do:

    DT[, `:=`(a1 = max(.SD), a2 = min(.SD)), by = a, .SDcols = paste0("V", 1:10)]
    

    or

    setkey(DT, a)
    DT[J(a), `:=`(a1 = max(.SD), a2 = min(.SD)), .SDcols = paste0("V", 1:10)]
    

    The second option uses the silent by-without-by.

    *of course you could also just use row.names or 1:nrow(DT)

    0 讨论(0)
提交回复
热议问题