问题
I have a dataframe of IDs and addresses. Normally, I would expect each recurring ID to have the same address in all observations, but some of my IDs have different addresses. I want to locate those observations that are duplicated on ID, but have at least 2 different addresses. Then, I want to randomize a new ID for one of them (an ID that didn't exist in the DF before).
For example:
ID Address
1 X
1 X
1 Y
2 Z
2 Z
3 A
3 B
4 C
4 D
4 E
5 F
5 F
5 F
Will return:
ID Address
1 X
1 X
6 Y
2 Z
2 Z
3 A
7 B
4 C
8 D
9 E
5 F
5 F
5 F
So what happened is the 3rd,7th, 9th and 10th observations got new IDs. I will mention that it is possible for an ID to have even more than 2 different addresses, so the granting of new IDs should happen for each unique address.
Edit:
I added a code for a longer example of a data frame, with rand column that should be ignored but kept in final output.
df <- data.frame(ID = c(1,1,1,2,2,3,3,4,4,4,5,5,5),
Address = c("x","x","y","z","z","a","b","c","d","e",
"f","f","f"),
rand = sample(1:100, 13))
回答1:
Here's a solution with tidyr
and functions nest
/ unnest
library(tidyr)
library(dplyr)
df %>% group_by(ID,Address) %>% nest %>%
`[<-`(duplicated(.$ID),"ID",max(.$ID, na.rm = TRUE) + 1:sum(duplicated(.$ID))) %>%
unnest
# # A tibble: 13 x 3
# ID Address rand
# <dbl> <fctr> <int>
# 1 1 x 58
# 2 1 x 4
# 3 6 y 75
# 4 2 z 5
# 5 2 z 19
# 6 3 a 55
# 7 7 b 34
# 8 4 c 53
# 9 8 d 98
# 10 9 e 97
# 11 5 f 13
# 12 5 f 64
# 13 5 f 80
If you use magrittr
, replace [<-
with inset
if you want prettier code (same output).
回答2:
An option would be data.table
. After grouping by 'ID', if
the number of unique
'Address' is greater than 1 and the 'Address' is not equal to the first unique
'Address', then get the row index (.I
) and assign those 'ID' with the 'ID's that are not already in the original dataset
library(data.table)
i1 <- setDT(df)[, .I[if(uniqueN(Address)>1) Address != unique(Address)[1]], ID]$V1
df[i1, ID := head(setdiff(as.numeric(1:10), unique(df$ID)), length(i1))]
df
# ID Address rand
# 1: 1 x 58
# 2: 1 x 4
# 3: 6 y 75
# 4: 2 z 5
# 5: 2 z 19
# 6: 3 a 55
# 7: 7 b 34
# 8: 4 c 53
# 9: 8 d 98
# 10: 9 e 97
# 11: 5 f 13
# 12: 5 f 64
# 13: 5 f 80
Or we can use base R
ids <- names(which(rowSums(table(unique(df)))>1))
i2 <- with(df, ID %in% ids & Address != ave(as.character(Address),
ID, FUN = function(x) x[1]))
df$ID[i2] <- head(setdiff(1:10, unique(df$ID)), sum(i2))
来源:https://stackoverflow.com/questions/47012120/identify-duplicates-of-one-value-with-different-values-in-another-column