Remove duplicates keeping entry with largest absolute value

前端 未结 7 2325
醉酒成梦
醉酒成梦 2020-11-28 10:12

Let\'s say I have four samples: id=1, 2, 3, and 4, with one or more measurements on each of those samples:

> a <- data.frame(id=c(1,1,2,2,3,4), value=c         


        
相关标签:
7条回答
  • 2020-11-28 10:15

    Another approach (though the code might look a little cumbersome) is to use ave():

    a[which(abs(a$value) == ave(a$value, a$id, 
                                FUN=function(x) max(abs(x)))), ]
    #   id value
    # 2  1     2
    # 4  2    -4
    # 5  3    -5
    # 6  4     6
    
    0 讨论(0)
  • 2020-11-28 10:18
    library(plyr)
    ddply(a, .(id), function(x) return(x[which(abs(x$value)==max(abs(x$value))),]))
    
    0 讨论(0)
  • 2020-11-28 10:23

    Check out ?aggregate:

    aggregate(value~id,a,function(x) x[which.max(abs(x))])
    

    I like the answer by @DWin, but I would like show how this could also work with metadata:

    aa<-merge(aggregate(value~id,a,function(x) x[which.max(abs(x))]),a)
    # Fails if the max value is duplicated for a single id without next line.
    aa[!duplicated(aa),]
    

    I couldn't help myself and created one last answer:

    do.call(rbind,lapply(split(a,a$id),function(x) x[which.max(abs(x$value)),]))
    
    0 讨论(0)
  • 2020-11-28 10:30

    First. Sort in the order putting the less desired items last within I’d groups

     aa <- a[order(a$id, -abs(a$value) ), ] #sort by id and reverse of abs(value)
    

    Then: Remove items after the first within I’d groups

     aa[ !duplicated(aa$id), ]              # take the first row within each id
      id value
    2  1     2
    4  2    -4
    5  3    -5
    6  4     6
    
    0 讨论(0)
  • 2020-11-28 10:32

    A data.table approach might be in order if your data set is very large:

    library(data.table)
    
    aDT <- as.data.table(a)
    setkey(aDT,"id")
    
    aDT[J(unique(id)), list(value = value[which.max(abs(value))])]
    


    Or a not as fast, but still fast, alternative :

    library(data.table)
    as.data.table(a)[, .SD[which.max(abs(value))], by=id]
    

    This version returns all the columns of a, in case there are more in the real dataset.

    0 讨论(0)
  • You can do this with dplyr as follows:

    library(dplyr)
    a %>%
      group_by(name) %>%
      filter(n == max(n)) %>%
      ungroup()
    
    0 讨论(0)
提交回复
热议问题