I\'d like to remove all items that appear more than once in a vector. Specifically, this includes character, numeric and integer vectors. Currently, I\'m using duplica
You can do this with rle
function:
tmp <- rle(sort(d))
res <- tmp$values[tmp$lengths == 1]
The idea is to find the count of same values in the vector.
There are plenty of alternatives here: Counting the number of elements with the values of x in a vector
Edit
After looking at the benchmarks, @NBATrends I got suspicious.
In theory counting items with a single pass through must be ~2x faster compared to original duplicated
logic.
I tried doing this with data.table
:
library(data.table)
dt <- data.table(d)
res <- dt[, count:= .N, by = d][count == 1]$d
And here are the benchmarks on different sample sizes for three solutions (I have reduced it to fast unique approaches):
You can see that with the growth of the sample data.table
begins to outperform other methods (2x).
Here is the code to reproduce:
set.seed(1001)
N <- c(3, 4, 5, 6 ,7)
n <- 10^N
res <- lapply(n, function(x) {
d <- sample(1:x/10, 5 * x, replace=T)
d <- c(d, sample(d, x, replace=T)) # ensure many duplicates
dt <- data.table(d)
mb <- microbenchmark::microbenchmark(
"duplicated(original)" = d[!(duplicated(d) | duplicated(d, fromLast=TRUE))],
"tabulate" = { ud = unique(d); ud[tabulate(match(d, ud)) == 1L] },
"data.table" = dt[, count:= .N, by = d][count == 1]$d,
times = 1,unit = "ms")
sm <- summary(mb)[, c(1, 4, 8)]
sm$size = x
return(sm)
})
res <- do.call("rbind", res)
require(ggplot2)
##The values Year, Value, School_ID are
##inherited by the geoms
ggplot(res, aes(x = res$size, y = res$mean, colour=res$exp)) +
geom_line() + scale_x_log10() + scale_y_log10() +
geom_point()