For each row return the column name of the largest value

前端 未结 8 2417
礼貌的吻别
礼貌的吻别 2020-11-21 07:06

I have a roster of employees, and I need to know at what department they are in most often. It is trivial to tabulate employee ID against department name, but it is trickier

8条回答
  •  轮回少年
    2020-11-21 07:42

    If you're interested in a data.table solution, here's one. It's a bit tricky since you prefer to get the id for the first maximum. It's much easier if you'd rather want the last maximum. Nevertheless, it's not that complicated and it's fast!

    Here I've generated data of your dimensions (26746 * 18).

    Data

    set.seed(45)
    DF <- data.frame(matrix(sample(10, 26746*18, TRUE), ncol=18))
    

    data.table answer:

    require(data.table)
    DT <- data.table(value=unlist(DF, use.names=FALSE), 
                colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
    setkey(DT, colid, value)
    t1 <- DT[J(unique(colid), DT[J(unique(colid)), value, mult="last"]), rowid, mult="first"]
    

    Benchmarking:

    # data.table solution
    system.time({
    DT <- data.table(value=unlist(DF, use.names=FALSE), 
                colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
    setkey(DT, colid, value)
    t1 <- DT[J(unique(colid), DT[J(unique(colid)), value, mult="last"]), rowid, mult="first"]
    })
    #   user  system elapsed 
    #  0.174   0.029   0.227 
    
    # apply solution from @thelatemail
    system.time(t2 <- colnames(DF)[apply(DF,1,which.max)])
    #   user  system elapsed 
    #  2.322   0.036   2.602 
    
    identical(t1, t2)
    # [1] TRUE
    

    It's about 11 times faster on data of these dimensions, and data.table scales pretty well too.


    Edit: if any of the max ids is okay, then:

    DT <- data.table(value=unlist(DF, use.names=FALSE), 
                colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
    setkey(DT, colid, value)
    t1 <- DT[J(unique(colid)), rowid, mult="last"]
    

提交回复
热议问题