Subset with unique cases, based on multiple columns

瘦欲@ 提交于 2019-11-28 03:34:19

You can use the duplicated() function to find the unique combinations:

> df[!duplicated(df[1:3]),]
  v1 v2 v3  v4 v5
1  7  1  A 100 98
2  7  2  A  98 97
3  8  1  C  NA 80
6  9  3  C  75 75

To get only the duplicates, you can check it in both directions:

> df[duplicated(df[1:3]) | duplicated(df[1:3], fromLast=TRUE),]
  v1 v2 v3 v4 v5
3  8  1  C NA 80
4  8  1  C 78 75
5  8  1  C 50 62

You can use the plyr package:

library(plyr)

ddply(df, c("v1","v2","v3"), head, 1)
#   v1 v2 v3  v4 v5
# 1  7  1  A 100 98
# 2  7  2  A  98 97
# 3  8  1  C  NA 80
# 4  9  3  C  75 75

ddply(df, c("v1","v2","v3"), function(x) if(nrow(x)>1) x else NULL)
#   v1 v2 v3 v4 v5
# 1  8  1  C NA 80
# 2  8  1  C 78 75
# 3  8  1  C 50 62

Using dplyr you could do:

library(dplyr)

# distinct
df %>% 
  distinct(v1, v2, v3, .keep_all = T)

# non-distinct only
df %>% 
  group_by(v1, v2, v3) %>% 
  filter(n() > 1)

# exclude any non-distinct
df %>% 
  group_by(v1, v2, v3) %>% 
  filter(n() == 1)

yeah but using plyr and ddply is very very slow if you have too much data.

you shd try something of this sort:

df[ cbind( which(duplicated(df[1:3])), which(duplicated(df[1:3], fromLast=TRUE))),]

or::

from = which(duplicated(df[1:3])
to = which(duplicated(df[1:3], fromLast=TRUE))
df[cbind(from,to),]

shd be faster for the most part.

test it out and let us know

there are some errors but im guessing you could fix those as long as you get the idea.

also try unique and all that

A non-elegant but functional way is to paste the entries of a given row together and find which are unique (or non-unique) rows, something like:

df.vector=apply(df,1,FUN=function(x) {paste(x,collapse="")})
df.table=table(df.vector)

then get the indexes of the duplicates with something like:

which(df.vector%in%names(which(df.table>1)))
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!