Fastest way to remove all duplicates in R

前端 未结 3 1827
不思量自难忘°
不思量自难忘° 2020-12-10 14:04

I\'d like to remove all items that appear more than once in a vector. Specifically, this includes character, numeric and integer vectors. Currently, I\'m using duplica

相关标签:
3条回答
  • 2020-12-10 14:45

    You could use a set operation:

    d <- c(1,2,3,4,1,5,6,4,2,1)
    duplicates = d[duplicated(d)]
    setdiff(d, duplicates)
    [1] 3 5 6
    

    (Not certain if that is more efficient than the above code but it does seem conceptually cleaner)

    0 讨论(0)
  • 2020-12-10 14:50

    Some timings:

    set.seed(1001)
    d <- sample(1:100000, 100000, replace=T)
    d <- c(d, sample(d, 20000, replace=T))  # ensure many duplicates
    mb <- microbenchmark::microbenchmark(
      d[!(duplicated(d) | duplicated(d, fromLast=TRUE))],
      setdiff(d, d[duplicated(d)]),
      {tmp <- rle(sort(d)); tmp$values[tmp$lengths == 1]},
      as.integer(names(table(d)[table(d)==1])),
      d[!(duplicated.default(d) | duplicated.default(d, fromLast=TRUE))],
      d[!(d %in% d[duplicated(d)])],
      { ud = unique(d); ud[tabulate(match(d, ud)) == 1L] },
      d[!(.Internal(duplicated(d, F, F, NA)) | .Internal(duplicated(d, F, T, NA)))]
    )
    summary(mb)[, c(1, 4)]  # in milliseconds
    #                                                                                expr      mean
    #1                               d[!(duplicated(d) | duplicated(d, fromLast = TRUE))]  18.34692
    #2                                                       setdiff(d, d[duplicated(d)])  24.84984
    #3                       {     tmp <- rle(sort(d))     tmp$values[tmp$lengths == 1] }   9.53831
    #4                                         as.integer(names(table(d)[table(d) == 1])) 255.76300
    #5               d[!(duplicated.default(d) | duplicated.default(d, fromLast = TRUE))]  18.35360
    #6                                                      d[!(d %in% d[duplicated(d)])]  24.01009
    #7                        {     ud = unique(d)     ud[tabulate(match(d, ud)) == 1L] }  32.10166
    #8 d[!(.Internal(duplicated(d, F, F, NA)) | .Internal(duplicated(d,      F, T, NA)))]  18.33475
    

    Given the comments let's see if they are all correct?

     results <- list(d[!(duplicated(d) | duplicated(d, fromLast=TRUE))],
             setdiff(d, d[duplicated(d)]),
             {tmp <- rle(sort(d)); tmp$values[tmp$lengths == 1]},
             as.integer(names(table(d)[table(d)==1])),
             d[!(duplicated.default(d) | duplicated.default(d, fromLast=TRUE))],
             d[!(d %in% d[duplicated(d)])],
             { ud = unique(d); ud[tabulate(match(d, ud)) == 1L] },
             d[!(.Internal(duplicated(d, F, F, NA)) | .Internal(duplicated(d, F, T, NA)))])
     all(sapply(ls, all.equal, c(3, 5, 6)))
     # TRUE
    
    0 讨论(0)
  • 2020-12-10 14:51

    You can do this with rle function:

    tmp <- rle(sort(d))
    res <- tmp$values[tmp$lengths == 1]
    

    The idea is to find the count of same values in the vector.

    There are plenty of alternatives here: Counting the number of elements with the values of x in a vector

    Edit

    After looking at the benchmarks, @NBATrends I got suspicious. In theory counting items with a single pass through must be ~2x faster compared to original duplicated logic.

    I tried doing this with data.table:

    library(data.table)
    dt <- data.table(d)
    res <-  dt[, count:= .N, by = d][count == 1]$d
    

    And here are the benchmarks on different sample sizes for three solutions (I have reduced it to fast unique approaches):

    You can see that with the growth of the sample data.table begins to outperform other methods (2x).

    Here is the code to reproduce:

    set.seed(1001)
    N <- c(3, 4, 5, 6 ,7)
    n <- 10^N
    res <- lapply(n, function(x) {
    d <- sample(1:x/10, 5 * x, replace=T)
    d <- c(d, sample(d, x, replace=T))  # ensure many duplicates
    dt <- data.table(d)
    mb <- microbenchmark::microbenchmark(
      "duplicated(original)" = d[!(duplicated(d) | duplicated(d, fromLast=TRUE))],
      "tabulate" = { ud = unique(d); ud[tabulate(match(d, ud)) == 1L] },
      "data.table" = dt[, count:= .N, by = d][count == 1]$d,
      times = 1,unit = "ms")
    sm <- summary(mb)[, c(1, 4, 8)]
    sm$size = x
    return(sm)
    
    })
    
    res <- do.call("rbind", res)
    
    require(ggplot2)
    ##The values Year, Value, School_ID are
    ##inherited by the geoms
    ggplot(res, aes(x = res$size, y = res$mean, colour=res$exp)) + 
    geom_line() + scale_x_log10() + scale_y_log10() +
    geom_point() 
    
    0 讨论(0)
提交回复
热议问题