Why is running “unique” faster on a data frame than a matrix in R?

后端未结

关注

 3  1426

I\'ve begun to believe that data frames hold no advantages over matrices, except for notational convenience. However, I noticed this oddity when running unique

相关标签:

3条回答

误落风尘

2020-12-30 00:23
1. Not sure but I guess that because matrix is one contiguous vector, R copies it into column vectors first (like a data.frame) because paste needs a list of vectors. Note that both are slow because both use paste.
2. Perhaps because unique.data.table is already many times faster. Please upgrade to v1.6.7 by downloading it from the R-Forge repository because that has the fix to unique you raised in this question. data.table doesn't use paste to do unique.
```
a = matrix(sample(2,10^6,replace = TRUE), ncol = 10)
b = as.data.frame(a)
system.time(u1<-unique(a))
   user  system elapsed 
   2.98    0.00    2.99 
system.time(u2<-unique(b))
   user  system elapsed 
   0.99    0.00    0.99 
c = as.data.table(b)
system.time(u3<-unique(c))
   user  system elapsed 
   0.03    0.02    0.05  # 60 times faster than u1, 20 times faster than u2
identical(as.data.table(u2),u3)
[1] TRUE
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
执念已碎

2020-12-30 00:27
1. In this implementation, unique.matrix is the same as unique.array
  
  > identical(unique.array, unique.matrix)
  
  [1] TRUE
2. unique.array has to handle multi-dimensional arrays which requires additional processing to ‘collapse’ the extra dimensions (those extra calls to paste()) which are not needed in the 2-dimensional case. The key section of code is:
  
  collapse <- (ndim > 1L) && (prod(dx[-MARGIN]) > 1L)
  
  temp <- if (collapse) apply(x, MARGIN, function(x) paste(x, collapse = "\r"))
3. unique.data.frame is optimised for the 2D case, unique.matrix is not. It could be, as you suggest, it just isn't in the current implementation.
Note that in all cases (unique.{array,matrix,data.table}) where there is more than one dimension it is the string representation that is compared for uniqueness. For floating point numbers this means 15 decimal digits so

NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 2), nrow = 2)))

is 1 while

NROW(unique(a <- matrix(rep(c(1, 1+5e-15), 2), nrow = 2)))

and

NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 1), nrow = 2)))

are both 2. Are you sure unique is what you want?
0 讨论(0)
发布评论:

提交评论
- 加载中...

野的像风

2020-12-30 00:27

In attempting to answer my own question, especially part 1, we can see where the time is spent by looking at the results of Rprof. I ran this again, with 5M elements.

Here are the results for the first unique operation (for the matrix):

> summaryRprof("u1.txt")
$by.self
                     self.time self.pct total.time total.pct
"paste"                   5.70    52.58       5.96     54.98
"apply"                   2.70    24.91      10.68     98.52
"FUN"                     0.86     7.93       6.82     62.92
"lapply"                  0.82     7.56       1.00      9.23
"list"                    0.30     2.77       0.30      2.77
"!"                       0.14     1.29       0.14      1.29
"c"                       0.10     0.92       0.10      0.92
"unlist"                  0.08     0.74       1.08      9.96
"aperm.default"           0.06     0.55       0.06      0.55
"is.null"                 0.06     0.55       0.06      0.55
"duplicated.default"      0.02     0.18       0.02      0.18

$by.total
                     total.time total.pct self.time self.pct
"unique"                  10.84    100.00      0.00     0.00
"unique.matrix"           10.84    100.00      0.00     0.00
"apply"                   10.68     98.52      2.70    24.91
"FUN"                      6.82     62.92      0.86     7.93
"paste"                    5.96     54.98      5.70    52.58
"unlist"                   1.08      9.96      0.08     0.74
"lapply"                   1.00      9.23      0.82     7.56
"list"                     0.30      2.77      0.30     2.77
"!"                        0.14      1.29      0.14     1.29
"do.call"                  0.14      1.29      0.00     0.00
"c"                        0.10      0.92      0.10     0.92
"aperm.default"            0.06      0.55      0.06     0.55
"is.null"                  0.06      0.55      0.06     0.55
"aperm"                    0.06      0.55      0.00     0.00
"duplicated.default"       0.02      0.18      0.02     0.18

$sample.interval
[1] 0.02

$sampling.time
[1] 10.84

And for the data frame:

> summaryRprof("u2.txt")
$by.self
                     self.time self.pct total.time total.pct
"paste"                   1.72    94.51       1.72     94.51
"[.data.frame"            0.06     3.30       1.82    100.00
"duplicated.default"      0.04     2.20       0.04      2.20

$by.total
                        total.time total.pct self.time self.pct
"[.data.frame"                1.82    100.00      0.06     3.30
"["                           1.82    100.00      0.00     0.00
"unique"                      1.82    100.00      0.00     0.00
"unique.data.frame"           1.82    100.00      0.00     0.00
"duplicated"                  1.76     96.70      0.00     0.00
"duplicated.data.frame"       1.76     96.70      0.00     0.00
"paste"                       1.72     94.51      1.72    94.51
"do.call"                     1.72     94.51      0.00     0.00
"duplicated.default"          0.04      2.20      0.04     2.20

$sample.interval
[1] 0.02

$sampling.time
[1] 1.82

What we notice is that the matrix version spends a lot of time on apply, paste, and lapply. In contrast, the data frame version simple runs duplicated.data.frame and most of the time is spent in paste, presumably aggregating results.

Although this explains where the time is going, it doesn't explain why these have different implementations, nor the effects of simply changing from one object type to another.

0 讨论(0)