Identify duplicates of one value with different values in another column

问题

I have a dataframe of IDs and addresses. Normally, I would expect each recurring ID to have the same address in all observations, but some of my IDs have different addresses. I want to locate those observations that are duplicated on ID, but have at least 2 different addresses. Then, I want to randomize a new ID for one of them (an ID that didn't exist in the DF before).

For example:

ID     Address
1      X
1      X  
1      Y
2      Z
2      Z
3      A
3      B
4      C
4      D
4      E
5      F
5      F
5      F

Will return:

ID    Address
1      X
1      X  
6      Y
2      Z
2      Z
3      A
7      B
4      C
8      D
9      E
5      F
5      F
5      F

So what happened is the 3rd,7th, 9th and 10th observations got new IDs. I will mention that it is possible for an ID to have even more than 2 different addresses, so the granting of new IDs should happen for each unique address.

Edit:

I added a code for a longer example of a data frame, with rand column that should be ignored but kept in final output.

df <- data.frame(ID = c(1,1,1,2,2,3,3,4,4,4,5,5,5),
             Address = c("x","x","y","z","z","a","b","c","d","e",
                         "f","f","f"),
             rand = sample(1:100, 13))

回答1:

Here's a solution with tidyr and functions nest / unnest

library(tidyr)
library(dplyr)
df %>% group_by(ID,Address) %>% nest %>%
  `[<-`(duplicated(.$ID),"ID",max(.$ID, na.rm = TRUE) + 1:sum(duplicated(.$ID))) %>%
  unnest

# # A tibble: 13 x 3
# ID Address  rand
#    <dbl>  <fctr> <int>
#  1     1       x    58
#  2     1       x     4
#  3     6       y    75
#  4     2       z     5
#  5     2       z    19
#  6     3       a    55
#  7     7       b    34
#  8     4       c    53
#  9     8       d    98
# 10     9       e    97
# 11     5       f    13
# 12     5       f    64
# 13     5       f    80

If you use magrittr, replace [<- with inset if you want prettier code (same output).

回答2:

An option would be data.table. After grouping by 'ID', if the number of unique 'Address' is greater than 1 and the 'Address' is not equal to the first unique 'Address', then get the row index (.I) and assign those 'ID' with the 'ID's that are not already in the original dataset

library(data.table)
i1 <- setDT(df)[,  .I[if(uniqueN(Address)>1) Address != unique(Address)[1]], ID]$V1
df[i1, ID := head(setdiff(as.numeric(1:10), unique(df$ID)), length(i1))] 
df
#     ID Address rand
#  1:  1       x   58
#  2:  1       x    4
#  3:  6       y   75
#  4:  2       z    5
#  5:  2       z   19
#  6:  3       a   55
#  7:  7       b   34
#  8:  4       c   53
#  9:  8       d   98
# 10:  9       e   97
# 11:  5       f   13
# 12:  5       f   64
# 13:  5       f   80

Or we can use base R

ids <- names(which(rowSums(table(unique(df)))>1))
i2 <- with(df, ID %in% ids & Address != ave(as.character(Address), 
                     ID, FUN = function(x) x[1]))
df$ID[i2] <- head(setdiff(1:10, unique(df$ID)), sum(i2))

来源：https://stackoverflow.com/questions/47012120/identify-duplicates-of-one-value-with-different-values-in-another-column

标签

duplicates

unique