Subset with unique cases, based on multiple columns

前端 未结 7 2128
执笔经年
执笔经年 2020-12-04 12:06

I\'d like to subset a dataframe to include only rows that have unique combinations of three columns. My situation is similar to the one presented in this question, but I\'d

相关标签:
7条回答
  • 2020-12-04 12:40

    You can use the duplicated() function to find the unique combinations:

    > df[!duplicated(df[1:3]),]
      v1 v2 v3  v4 v5
    1  7  1  A 100 98
    2  7  2  A  98 97
    3  8  1  C  NA 80
    6  9  3  C  75 75
    

    To get only the duplicates, you can check it in both directions:

    > df[duplicated(df[1:3]) | duplicated(df[1:3], fromLast=TRUE),]
      v1 v2 v3 v4 v5
    3  8  1  C NA 80
    4  8  1  C 78 75
    5  8  1  C 50 62
    
    0 讨论(0)
  • 2020-12-04 12:42

    yeah but using plyr and ddply is very very slow if you have too much data.

    you shd try something of this sort:

    df[ cbind( which(duplicated(df[1:3])), which(duplicated(df[1:3], fromLast=TRUE))),]
    

    or::

    from = which(duplicated(df[1:3])
    to = which(duplicated(df[1:3], fromLast=TRUE))
    df[cbind(from,to),]
    

    shd be faster for the most part.

    test it out and let us know

    there are some errors but im guessing you could fix those as long as you get the idea.

    also try unique and all that

    0 讨论(0)
  • 2020-12-04 12:43

    A non-elegant but functional way is to paste the entries of a given row together and find which are unique (or non-unique) rows, something like:

    df.vector=apply(df,1,FUN=function(x) {paste(x,collapse="")})
    df.table=table(df.vector)
    

    then get the indexes of the duplicates with something like:

    which(df.vector%in%names(which(df.table>1)))
    
    0 讨论(0)
  • 2020-12-04 12:49

    You can use the plyr package:

    library(plyr)
    
    ddply(df, c("v1","v2","v3"), head, 1)
    #   v1 v2 v3  v4 v5
    # 1  7  1  A 100 98
    # 2  7  2  A  98 97
    # 3  8  1  C  NA 80
    # 4  9  3  C  75 75
    
    ddply(df, c("v1","v2","v3"), function(x) if(nrow(x)>1) x else NULL)
    #   v1 v2 v3 v4 v5
    # 1  8  1  C NA 80
    # 2  8  1  C 78 75
    # 3  8  1  C 50 62
    
    0 讨论(0)
  • 2020-12-04 12:52

    Using dplyr you could do:

    library(dplyr)
    
    # distinct
    df %>% 
      distinct(v1, v2, v3, .keep_all = T)
    
    # non-distinct only
    df %>% 
      group_by(v1, v2, v3) %>% 
      filter(n() > 1)
    
    # exclude any non-distinct
    df %>% 
      group_by(v1, v2, v3) %>% 
      filter(n() == 1)
    
    0 讨论(0)
  • 2020-12-04 12:59

    Provided you're working with data.table, it's easy to have unique cases with respect to one or more variables. The syntax would be the following

    unique(DT, by = "var1, var2")
    
    0 讨论(0)
提交回复
热议问题