I\'ve begun to believe that data frames hold no advantages over matrices, except for notational convenience. However, I noticed this oddity when running unique
Not sure but I guess that because matrix
is one contiguous vector, R copies it into column vectors first (like a data.frame
) because paste
needs a list of vectors. Note that both are slow because both use paste
.
Perhaps because unique.data.table
is already many times faster. Please upgrade to v1.6.7 by downloading it from the R-Forge repository because that has the fix to unique
you raised in this question. data.table
doesn't use paste
to do unique
.
a = matrix(sample(2,10^6,replace = TRUE), ncol = 10) b = as.data.frame(a) system.time(u1<-unique(a)) user system elapsed 2.98 0.00 2.99 system.time(u2<-unique(b)) user system elapsed 0.99 0.00 0.99 c = as.data.table(b) system.time(u3<-unique(c)) user system elapsed 0.03 0.02 0.05 # 60 times faster than u1, 20 times faster than u2 identical(as.data.table(u2),u3) [1] TRUE
In this implementation, unique.matrix
is the same as unique.array
> identical(unique.array, unique.matrix)
[1] TRUE
unique.array
has to handle multi-dimensional arrays which requires additional processing to ‘collapse’ the extra dimensions (those extra calls to paste()
) which are not needed in the 2-dimensional case. The key section of code is:
collapse <- (ndim > 1L) && (prod(dx[-MARGIN]) > 1L)
temp <- if (collapse)
apply(x, MARGIN, function(x) paste(x, collapse = "\r"))
unique.data.frame
is optimised for the 2D case, unique.matrix
is not. It could be, as you suggest, it just isn't in the current implementation.
Note that in all cases (unique.{array,matrix,data.table}) where there is more than one dimension it is the string representation that is compared for uniqueness. For floating point numbers this means 15 decimal digits so
NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 2), nrow = 2)))
is 1
while
NROW(unique(a <- matrix(rep(c(1, 1+5e-15), 2), nrow = 2)))
and
NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 1), nrow = 2)))
are both 2
. Are you sure unique
is what you want?
In attempting to answer my own question, especially part 1, we can see where the time is spent by looking at the results of Rprof
. I ran this again, with 5M elements.
Here are the results for the first unique operation (for the matrix):
> summaryRprof("u1.txt")
$by.self
self.time self.pct total.time total.pct
"paste" 5.70 52.58 5.96 54.98
"apply" 2.70 24.91 10.68 98.52
"FUN" 0.86 7.93 6.82 62.92
"lapply" 0.82 7.56 1.00 9.23
"list" 0.30 2.77 0.30 2.77
"!" 0.14 1.29 0.14 1.29
"c" 0.10 0.92 0.10 0.92
"unlist" 0.08 0.74 1.08 9.96
"aperm.default" 0.06 0.55 0.06 0.55
"is.null" 0.06 0.55 0.06 0.55
"duplicated.default" 0.02 0.18 0.02 0.18
$by.total
total.time total.pct self.time self.pct
"unique" 10.84 100.00 0.00 0.00
"unique.matrix" 10.84 100.00 0.00 0.00
"apply" 10.68 98.52 2.70 24.91
"FUN" 6.82 62.92 0.86 7.93
"paste" 5.96 54.98 5.70 52.58
"unlist" 1.08 9.96 0.08 0.74
"lapply" 1.00 9.23 0.82 7.56
"list" 0.30 2.77 0.30 2.77
"!" 0.14 1.29 0.14 1.29
"do.call" 0.14 1.29 0.00 0.00
"c" 0.10 0.92 0.10 0.92
"aperm.default" 0.06 0.55 0.06 0.55
"is.null" 0.06 0.55 0.06 0.55
"aperm" 0.06 0.55 0.00 0.00
"duplicated.default" 0.02 0.18 0.02 0.18
$sample.interval
[1] 0.02
$sampling.time
[1] 10.84
And for the data frame:
> summaryRprof("u2.txt")
$by.self
self.time self.pct total.time total.pct
"paste" 1.72 94.51 1.72 94.51
"[.data.frame" 0.06 3.30 1.82 100.00
"duplicated.default" 0.04 2.20 0.04 2.20
$by.total
total.time total.pct self.time self.pct
"[.data.frame" 1.82 100.00 0.06 3.30
"[" 1.82 100.00 0.00 0.00
"unique" 1.82 100.00 0.00 0.00
"unique.data.frame" 1.82 100.00 0.00 0.00
"duplicated" 1.76 96.70 0.00 0.00
"duplicated.data.frame" 1.76 96.70 0.00 0.00
"paste" 1.72 94.51 1.72 94.51
"do.call" 1.72 94.51 0.00 0.00
"duplicated.default" 0.04 2.20 0.04 2.20
$sample.interval
[1] 0.02
$sampling.time
[1] 1.82
What we notice is that the matrix version spends a lot of time on apply
, paste
, and lapply
. In contrast, the data frame version simple runs duplicated.data.frame
and most of the time is spent in paste
, presumably aggregating results.
Although this explains where the time is going, it doesn't explain why these have different implementations, nor the effects of simply changing from one object type to another.