Why is running “unique” faster on a data frame than a matrix in R?

后端 未结 3 1400
旧巷少年郎
旧巷少年郎 2020-12-29 23:53

I\'ve begun to believe that data frames hold no advantages over matrices, except for notational convenience. However, I noticed this oddity when running unique

相关标签:
3条回答
  • 2020-12-30 00:23
    1. Not sure but I guess that because matrix is one contiguous vector, R copies it into column vectors first (like a data.frame) because paste needs a list of vectors. Note that both are slow because both use paste.

    2. Perhaps because unique.data.table is already many times faster. Please upgrade to v1.6.7 by downloading it from the R-Forge repository because that has the fix to unique you raised in this question. data.table doesn't use paste to do unique.

    a = matrix(sample(2,10^6,replace = TRUE), ncol = 10)
    b = as.data.frame(a)
    system.time(u1<-unique(a))
       user  system elapsed 
       2.98    0.00    2.99 
    system.time(u2<-unique(b))
       user  system elapsed 
       0.99    0.00    0.99 
    c = as.data.table(b)
    system.time(u3<-unique(c))
       user  system elapsed 
       0.03    0.02    0.05  # 60 times faster than u1, 20 times faster than u2
    identical(as.data.table(u2),u3)
    [1] TRUE
    
    0 讨论(0)
  • 2020-12-30 00:27
    1. In this implementation, unique.matrix is the same as unique.array

      > identical(unique.array, unique.matrix)

      [1] TRUE

    2. unique.array has to handle multi-dimensional arrays which requires additional processing to ‘collapse’ the extra dimensions (those extra calls to paste()) which are not needed in the 2-dimensional case. The key section of code is:

      collapse <- (ndim > 1L) && (prod(dx[-MARGIN]) > 1L)

      temp <- if (collapse) apply(x, MARGIN, function(x) paste(x, collapse = "\r"))

    3. unique.data.frame is optimised for the 2D case, unique.matrix is not. It could be, as you suggest, it just isn't in the current implementation.

    Note that in all cases (unique.{array,matrix,data.table}) where there is more than one dimension it is the string representation that is compared for uniqueness. For floating point numbers this means 15 decimal digits so

    NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 2), nrow = 2)))

    is 1 while

    NROW(unique(a <- matrix(rep(c(1, 1+5e-15), 2), nrow = 2)))

    and

    NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 1), nrow = 2)))

    are both 2. Are you sure unique is what you want?

    0 讨论(0)
  • 2020-12-30 00:27

    In attempting to answer my own question, especially part 1, we can see where the time is spent by looking at the results of Rprof. I ran this again, with 5M elements.

    Here are the results for the first unique operation (for the matrix):

    > summaryRprof("u1.txt")
    $by.self
                         self.time self.pct total.time total.pct
    "paste"                   5.70    52.58       5.96     54.98
    "apply"                   2.70    24.91      10.68     98.52
    "FUN"                     0.86     7.93       6.82     62.92
    "lapply"                  0.82     7.56       1.00      9.23
    "list"                    0.30     2.77       0.30      2.77
    "!"                       0.14     1.29       0.14      1.29
    "c"                       0.10     0.92       0.10      0.92
    "unlist"                  0.08     0.74       1.08      9.96
    "aperm.default"           0.06     0.55       0.06      0.55
    "is.null"                 0.06     0.55       0.06      0.55
    "duplicated.default"      0.02     0.18       0.02      0.18
    
    $by.total
                         total.time total.pct self.time self.pct
    "unique"                  10.84    100.00      0.00     0.00
    "unique.matrix"           10.84    100.00      0.00     0.00
    "apply"                   10.68     98.52      2.70    24.91
    "FUN"                      6.82     62.92      0.86     7.93
    "paste"                    5.96     54.98      5.70    52.58
    "unlist"                   1.08      9.96      0.08     0.74
    "lapply"                   1.00      9.23      0.82     7.56
    "list"                     0.30      2.77      0.30     2.77
    "!"                        0.14      1.29      0.14     1.29
    "do.call"                  0.14      1.29      0.00     0.00
    "c"                        0.10      0.92      0.10     0.92
    "aperm.default"            0.06      0.55      0.06     0.55
    "is.null"                  0.06      0.55      0.06     0.55
    "aperm"                    0.06      0.55      0.00     0.00
    "duplicated.default"       0.02      0.18      0.02     0.18
    
    $sample.interval
    [1] 0.02
    
    $sampling.time
    [1] 10.84
    

    And for the data frame:

    > summaryRprof("u2.txt")
    $by.self
                         self.time self.pct total.time total.pct
    "paste"                   1.72    94.51       1.72     94.51
    "[.data.frame"            0.06     3.30       1.82    100.00
    "duplicated.default"      0.04     2.20       0.04      2.20
    
    $by.total
                            total.time total.pct self.time self.pct
    "[.data.frame"                1.82    100.00      0.06     3.30
    "["                           1.82    100.00      0.00     0.00
    "unique"                      1.82    100.00      0.00     0.00
    "unique.data.frame"           1.82    100.00      0.00     0.00
    "duplicated"                  1.76     96.70      0.00     0.00
    "duplicated.data.frame"       1.76     96.70      0.00     0.00
    "paste"                       1.72     94.51      1.72    94.51
    "do.call"                     1.72     94.51      0.00     0.00
    "duplicated.default"          0.04      2.20      0.04     2.20
    
    $sample.interval
    [1] 0.02
    
    $sampling.time
    [1] 1.82
    

    What we notice is that the matrix version spends a lot of time on apply, paste, and lapply. In contrast, the data frame version simple runs duplicated.data.frame and most of the time is spent in paste, presumably aggregating results.

    Although this explains where the time is going, it doesn't explain why these have different implementations, nor the effects of simply changing from one object type to another.

    0 讨论(0)
提交回复
热议问题