Why is running “unique” faster on a data frame than a matrix in R?

后端未结

关注

 3  1409

旧巷少年郎 2020-12-29 23:53

I\'ve begun to believe that data frames hold no advantages over matrices, except for notational convenience. However, I noticed this oddity when running unique

3条回答

执念已碎 (楼主)

2020-12-30 00:27
1. In this implementation, unique.matrix is the same as unique.array
  
  > identical(unique.array, unique.matrix)
  
  [1] TRUE
2. unique.array has to handle multi-dimensional arrays which requires additional processing to ‘collapse’ the extra dimensions (those extra calls to paste()) which are not needed in the 2-dimensional case. The key section of code is:
  
  collapse <- (ndim > 1L) && (prod(dx[-MARGIN]) > 1L)
  
  temp <- if (collapse) apply(x, MARGIN, function(x) paste(x, collapse = "\r"))
3. unique.data.frame is optimised for the 2D case, unique.matrix is not. It could be, as you suggest, it just isn't in the current implementation.
Note that in all cases (unique.{array,matrix,data.table}) where there is more than one dimension it is the string representation that is compared for uniqueness. For floating point numbers this means 15 decimal digits so

NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 2), nrow = 2)))

is 1 while

NROW(unique(a <- matrix(rep(c(1, 1+5e-15), 2), nrow = 2)))

and

NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 1), nrow = 2)))

are both 2. Are you sure unique is what you want?
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...