What's the fastest way to merge/join data.frames in R?

前端 未结 5 1543
走了就别回头了
走了就别回头了 2020-11-27 08:58

For example (not sure if most representative example though):

N <- 1e6
d1 <- data.frame(x=sample(N,N), y1=rnorm(N))
d2 <- data.frame(x=sample(N,N),          


        
5条回答
  •  无人及你
    2020-11-27 10:01

    Thought it would be interesting to post a benchmark with dplyr in the mix: (had a lot of things running)

                test replications elapsed relative user.self sys.self
    5          dplyr            1    0.25     1.00      0.25     0.00
    3 data.tableGood            1    0.28     1.12      0.27     0.00
    6          sqldf            1    0.58     2.32      0.57     0.00
    2  data.tableBad            1    1.10     4.40      1.09     0.01
    1      aggregate            1    4.79    19.16      4.73     0.02
    4           plyr            1  186.70   746.80    152.11    30.27
    
    packageVersion("data.table")
    [1] ‘1.8.10’
    packageVersion("plyr")
    [1] ‘1.8’
    packageVersion("sqldf")
    [1] ‘0.4.7’
    packageVersion("dplyr")
    [1] ‘0.1.2’
    R.version.string
    [1] "R version 3.0.2 (2013-09-25)"
    

    Just added:

    dplyr = summarise(dt_dt, avx = mean(x), avy = mean(y))
    

    and setup the data for dplyr with a data table:

    dt <- tbl_dt(d)
    dt_dt <- group_by(dt, g1, g2)
    

    Updated: I removed data.tableBad and plyr and nothing but RStudio open (i7, 16GB ram).

    With data.table 1.9 and dplyr with data frame:

                test replications elapsed relative user.self sys.self
    2 data.tableGood            1    0.02      1.0      0.02     0.00
    3          dplyr            1    0.04      2.0      0.04     0.00
    4          sqldf            1    0.46     23.0      0.46     0.00
    1      aggregate            1    6.11    305.5      6.10     0.02
    

    With data.table 1.9 and dplyr with data table:

                test replications elapsed relative user.self sys.self
    2 data.tableGood            1    0.02        1      0.02     0.00
    3          dplyr            1    0.02        1      0.02     0.00
    4          sqldf            1    0.44       22      0.43     0.02
    1      aggregate            1    6.14      307      6.10     0.01
    
    packageVersion("data.table")
    [1] '1.9.0'
    packageVersion("dplyr")
    [1] '0.1.2'
    

    For consistency here is the original with all and data.table 1.9 and dplyr using a data table:

                test replications elapsed relative user.self sys.self
    5          dplyr            1    0.01        1      0.02     0.00
    3 data.tableGood            1    0.02        2      0.01     0.00
    6          sqldf            1    0.47       47      0.46     0.00
    1      aggregate            1    6.16      616      6.16     0.00
    2  data.tableBad            1   15.45     1545     15.38     0.01
    4           plyr            1  110.23    11023     90.46    19.52
    

    I think this data is too small for the new data.table and dplyr :)

    Larger data set:

    N <- 1e8
    g1 <- sample(1:50000, N, replace = TRUE)
    g2<- sample(1:50000, N, replace = TRUE)
    d <- data.frame(x=sample(N,N), y=rnorm(N), g1, g2)
    

    Took around 10-13GB of ram just to hold the data before running the benchmark.

    Results:

                test replications elapsed relative user.self sys.self
    1          dplyr            1   14.88        1      6.24     7.52
    2 data.tableGood            1   28.41        1     18.55      9.4
    

    Tried a 1 billion but blew up ram. 32GB will handle it no problem.


    [Edit by Arun] (dotcomken , could you please run this code and paste your benchmarking results? Thanks).

    require(data.table)
    require(dplyr)
    require(rbenchmark)
    
    N <- 1e8
    g1 <- sample(1:50000, N, replace = TRUE)
    g2 <- sample(1:50000, N, replace = TRUE)
    d <- data.frame(x=sample(N,N), y=rnorm(N), g1, g2)
    
    benchmark(replications = 5, order = "elapsed", 
      data.table = {
         dt <- as.data.table(d) 
         dt[, lapply(.SD, mean), by = "g1,g2"]
      }, 
      dplyr_DF = d %.% group_by(g1, g2) %.% summarise(avx = mean(x), avy=mean(y))
    ) 
    

    As per Arun's request here the output of what you provided me to run:

            test replications elapsed relative user.self sys.self
    1 data.table            5   15.35     1.00     13.77     1.57
    2   dplyr_DF            5  137.84     8.98    136.31     1.44
    

    Sorry for the confusion, late night got to me.

    Using dplyr with data frame seems to be the less efficient way to process summaries. Is this methods to compare the exact functionality of data.table and dplyr with their data structure methods included? I'd almost prefer to separate that as most data will need to be cleaned before we group_by or create the data.table. It could be a matter of taste but I think the most important part is how efficiently the data can be modeled.

提交回复
热议问题