Identifying duplicated rows

回眸只為那壹抹淺笑 提交于 2019-12-02 19:07:00

问题


I have a larger data frame (~50K rows and 50 to 75 columns) that has a small number of row that are duplicated in, say, 7 of the 75 columns. Although it's simple enough to locate rows that duplicate rows above using duplicated(...), I want to be able to pull out the duplicated rows and the row that is duplicated, or if (stolen from an earlier post)

a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c(1,1,2,4,1,1,2,2)
d <- c('x','y','x','z','y','y','z','x')
df <- data.frame(a,b,d)
df
  a b d
1 A 1 x
2 A 1 y
3 A 2 x
4 B 4 z
5 B 1 y
6 B 1 y
7 C 2 z
8 C 2 x

duplicated(df[,c(1,2)]) gives me rows 2, 6, and 8. Row 2 duplicates row 1, row 6 duplicates 5, and row 8 duplicates 7 on the basis of columns 1 and 2. So I want to review rows 1 and 2 to see what the differences, if any, might be in column d. Easy enough with 8 rows and 3 columns, but my problem is much bigger.

To sum up, I'm looking for a simple way to find the row indices for, say rows 1 and 2, 5, and 6, and 7 and 8 based on a subset of the 50-75 columns, so I can visually compare the rows duplicated based on the subset.

Thoughts?


回答1:


which(duplicated(df[,1:2])|duplicated(df[,1:2],fromLast=T))
#[1] 1 2 5 6 7 8


来源:https://stackoverflow.com/questions/25041933/identifying-duplicated-rows

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!