Filtering a dataframe showing only duplicates

只愿长相守 提交于 2019-12-29 09:26:14

问题


I need some help to filter a dataframe.

The df has several columns and I want to split it into two dataframes:

1- One including only the rows in which the first column is a duplicate (including all of the replicas).

2- The rest of the rows, which are not duplicates.

Here is an example: This would be the original.

          V1  V2 
    [1,] "A" "1"
    [2,] "B" "1"
    [3,] "A" "1"
    [4,] "C" "2"
    [5,] "D" "3"
    [6,] "D" "4"

I want to turn into this:

         V1  V2 
   [1,] "A" "1"
   [2,] "A" "1"
   [3,] "D" "3"
   [4,] "D" "4"

And this:

        V1  V2 
  [1,] "B" "1"
  [2,] "C" "2"

Is there a way to do that? I have tried exporting to Excel, but the dataset was too large to make that viable.

Thank you


回答1:


Try

d[!duplicated(d),]

and

d[duplicated(d),]

where d is your database.

=== UPDATE === If only the first column is desired, and all duplicates need to go in a separate column, you could do:

library(gdata) d[duplicated2(d$V1,bothWays = T),] d[!duplicated2(d$V1,bothWays = T),]

If only base R is desired, then:

bm <- duplicated(d$V1) | duplicated(d$V1,fromLast = TRUE) d[bm,] d[!bm,]




回答2:


You can use duplicated but bear in mind that duplicated only returns TRUE at the first duplicated value, i.e.

> duplicated(c("A", "A", "A"))
[1] FALSE  TRUE  TRUE 

does not return TRUE for the first "A". If you want to catch all values of "A" you can e.g. use

duplicated(c("A", "A", "A")) | duplicated(c("A", "A", "A"), fromLast = TRUE)
# [1] TRUE TRUE TRUE

You can then separate your data using

## Index of the duplicated values:
indDuplicatedVec <- duplicated(d[,1]) | duplicated(d[,1], fromLast = TRUE)

myDuplicates <- d[indDuplicatedVec, ]
myUniques <- d[!indDuplicatedVec, ]

> myDuplicates
#V1 V2
#1  A  1
#3  A  1
#5  D  3
#6  D  4

> myUniques
#V1 V2
#2  B  1
#4  C  2



回答3:


Considering df as your input, you can use dplyr and try:

df %>% group_by(V1) %>% filter(n() > 1)

for the duplicates

and

df %>% group_by(V1) %>% filter(n() == 1)

for the unique entries.




回答4:


We can use data.table

library(data.table)
setDT(df)[, .SD[.N >1], V1]


来源:https://stackoverflow.com/questions/43510160/filtering-a-dataframe-showing-only-duplicates

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!