Split data.table into roughly equal parts

前端 未结 4 1104
抹茶落季
抹茶落季 2021-01-20 13:59

To parallelize a task, I need to split a big data.table to roughly equal parts, keeping together groups deinfed by a column, id. Suppose:

N

4条回答
  •  温柔的废话
    2021-01-20 14:13

    Preliminary comment

    I recommend reading what the main author of data.table has to say about parallelization with it.

    I don't know how familiar you are with data.table, but you may have overlooked its by argument...? Quoting @eddi's comment from below...

    Instead of literally splitting up the data - create a new "parallel.id" column, and then call

    dt[, parallel_operation(.SD), by = parallel.id] 
    

    Answer, assuming you don't want to use by

    Sort the IDs by size:

    ids   <- names(sort(table(dt$id)))
    n     <- length(ids)
    

    Rearrange so that we alternate between big and small IDs, following Arun's interleaving trick:

    alt_ids <- c(ids, rev(ids))[order(c(1:n, 1:n))][1:n]
    

    Split the ids in order, with roughly the same number of IDs in each group (like zero323's answer):

    gs  <- split(alt_ids, ceiling(seq(n) / (n/M)))
    
    res <- vector("list", M)
    setkey(dt, id)
    for (m in 1:M) res[[m]] <- dt[J(gs[[m]])] 
    # if using a data.frame, replace the last two lines with
    # for (m in 1:M) res[[m]] <- dt[id %in% gs[[m]],] 
    

    Check that the sizes aren't too bad:

    # using the OP's example data...
    
    sapply(res, nrow)
    # [1] 7 9              for M = 2
    # [1] 5 5 6            for M = 3
    # [1] 1 6 3 6          for M = 4
    # [1] 1 4 2 3 6        for M = 5
    

    Although I emphasized data.table at the top, this should work fine with a data.frame, too.

提交回复
热议问题