Identify duplicates of one value with different values in another column

倖福魔咒の 提交于 2019-12-13 17:32:14

问题


I have a dataframe of IDs and addresses. Normally, I would expect each recurring ID to have the same address in all observations, but some of my IDs have different addresses. I want to locate those observations that are duplicated on ID, but have at least 2 different addresses. Then, I want to randomize a new ID for one of them (an ID that didn't exist in the DF before).

For example:

ID     Address
1      X
1      X  
1      Y
2      Z
2      Z
3      A
3      B
4      C
4      D
4      E
5      F
5      F
5      F

Will return:

ID    Address
1      X
1      X  
6      Y
2      Z
2      Z
3      A
7      B
4      C
8      D
9      E
5      F
5      F
5      F

So what happened is the 3rd,7th, 9th and 10th observations got new IDs. I will mention that it is possible for an ID to have even more than 2 different addresses, so the granting of new IDs should happen for each unique address.

Edit:

I added a code for a longer example of a data frame, with rand column that should be ignored but kept in final output.

df <- data.frame(ID = c(1,1,1,2,2,3,3,4,4,4,5,5,5),
             Address = c("x","x","y","z","z","a","b","c","d","e",
                         "f","f","f"),
             rand = sample(1:100, 13))

回答1:


Here's a solution with tidyr and functions nest / unnest

library(tidyr)
library(dplyr)
df %>% group_by(ID,Address) %>% nest %>%
  `[<-`(duplicated(.$ID),"ID",max(.$ID, na.rm = TRUE) + 1:sum(duplicated(.$ID))) %>%
  unnest

# # A tibble: 13 x 3
# ID Address  rand
#    <dbl>  <fctr> <int>
#  1     1       x    58
#  2     1       x     4
#  3     6       y    75
#  4     2       z     5
#  5     2       z    19
#  6     3       a    55
#  7     7       b    34
#  8     4       c    53
#  9     8       d    98
# 10     9       e    97
# 11     5       f    13
# 12     5       f    64
# 13     5       f    80

If you use magrittr, replace [<- with inset if you want prettier code (same output).




回答2:


An option would be data.table. After grouping by 'ID', if the number of unique 'Address' is greater than 1 and the 'Address' is not equal to the first unique 'Address', then get the row index (.I) and assign those 'ID' with the 'ID's that are not already in the original dataset

library(data.table)
i1 <- setDT(df)[,  .I[if(uniqueN(Address)>1) Address != unique(Address)[1]], ID]$V1
df[i1, ID := head(setdiff(as.numeric(1:10), unique(df$ID)), length(i1))] 
df
#     ID Address rand
#  1:  1       x   58
#  2:  1       x    4
#  3:  6       y   75
#  4:  2       z    5
#  5:  2       z   19
#  6:  3       a   55
#  7:  7       b   34
#  8:  4       c   53
#  9:  8       d   98
# 10:  9       e   97
# 11:  5       f   13
# 12:  5       f   64
# 13:  5       f   80

Or we can use base R

ids <- names(which(rowSums(table(unique(df)))>1))
i2 <- with(df, ID %in% ids & Address != ave(as.character(Address), 
                     ID, FUN = function(x) x[1]))
df$ID[i2] <- head(setdiff(1:10, unique(df$ID)), sum(i2))


来源:https://stackoverflow.com/questions/47012120/identify-duplicates-of-one-value-with-different-values-in-another-column

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!