How to parallelelize do() calls with dplyr

后端 未结 3 1004
清酒与你
清酒与你 2021-01-04 00:12

I\'m trying to figure out how to deploy the dplyr::do function in parallel. After reading some the docs it seems that the dplyr::init_cluster() should be suffic

3条回答
  •  遥遥无期
    2021-01-04 00:15

    As per mentionned by @Maciej, you could try multidplyr:

    ## Install from github
    devtools::install_github("hadley/multidplyr")
    

    Use partition() to split your dataset across multiples cores:

    library(dplyr)
    library(multidplyr)
    test <- data_frame(a=1:3, b=letters[c(1:2, 1)])
    test1 <- partition(test, a)
    

    You'll initialize a 3 cores cluster (one for each a)

    # Initialising 3 core cluster.
    

    Then simply perform your do() call:

    test1 %>%
      do({
        dplyr::data_frame(c = rep(max(.$a)), times = max(.$a))
      })
    

    Which gives:

    #Source: party_df [3 x 3]
    #Groups: a
    #Shards: 3 [1--1 rows]
    #
    #      a     c times
    #  (int) (int) (int)
    #1     1     1     1
    #2     2     2     2
    #3     3     3     3
    

提交回复
热议问题