Why is running “unique” faster on a data frame than a matrix in R?

后端 未结 3 1409
旧巷少年郎
旧巷少年郎 2020-12-29 23:53

I\'ve begun to believe that data frames hold no advantages over matrices, except for notational convenience. However, I noticed this oddity when running unique

3条回答
  •  执念已碎
    2020-12-30 00:27

    1. In this implementation, unique.matrix is the same as unique.array

      > identical(unique.array, unique.matrix)

      [1] TRUE

    2. unique.array has to handle multi-dimensional arrays which requires additional processing to ‘collapse’ the extra dimensions (those extra calls to paste()) which are not needed in the 2-dimensional case. The key section of code is:

      collapse <- (ndim > 1L) && (prod(dx[-MARGIN]) > 1L)

      temp <- if (collapse) apply(x, MARGIN, function(x) paste(x, collapse = "\r"))

    3. unique.data.frame is optimised for the 2D case, unique.matrix is not. It could be, as you suggest, it just isn't in the current implementation.

    Note that in all cases (unique.{array,matrix,data.table}) where there is more than one dimension it is the string representation that is compared for uniqueness. For floating point numbers this means 15 decimal digits so

    NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 2), nrow = 2)))

    is 1 while

    NROW(unique(a <- matrix(rep(c(1, 1+5e-15), 2), nrow = 2)))

    and

    NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 1), nrow = 2)))

    are both 2. Are you sure unique is what you want?

提交回复
热议问题