问题
I am building upon this question find duplicate, compare a condition, erase one row r to solve a more complicated case.
Using the following reproducible example:
ID1<-c("a1","a4","a6","a6","a5", "a1",NA,"a3", "a2","a2", "a8", "a9", "a9")
ID2<-c("b8","b99","b5","b5","b2","b8" , "b7","b7", "b6","b6",NA,"b9",NA)
Value1<-c(2,5,6,6,2,7, NA,5,NA,4,4,6,6)
Value2<- c(23,51,63,64,23,23,5,6,4,NA,NA,4,NA)
Year<- c(2004,2004,2004,2004,2005,2004,2008,2009, 2008,2009,2014,2016,2016)
df<-data.frame(ID1,ID2,Value1,Value2,Year)
I want to select rows where ID1 and ID2 and Year have the same value in their respective columns. For this rows I want to compare Value1 and Value2 in the duplicates rows and IF the values are not the same erase the row with within the column higher value (because of the data structure this will be unambiguous ).
Expected Results:
Expected
# ID1 ID2 Value1 Value2 Year
# 1 a1 b8 2 23 2004
# 2 a4 b99 5 51 2004
# 3 a6 b5 6 63 2004
# 5 a5 b2 2 23 2005
# 7 <NA> b7 NA 5 2008
# 8 a3 b7 5 6 2009
# 9 a2 b6 NA 4 2008
# 10 a2 b6 4 NA 2009
# 11 a8 <NA> 4 NA 2014
# 12 a9 b9 6 4 2016
First solution:
df_new <- aggregate(.~ID1 + ID2 + Year, df, min, na.action = na.pass)
PROBLEM: it deletes raws when one of the IDs is NA
I then changed NAs to a character value
df$ID1[is.na(df$ID1)] <- "Missing_data"
df$ID2[is.na(df$ID2)] <- "Missing_data"
df_new <- aggregate(.~ID1 + ID2 + Year, df, min, na.action = na.pass)
I solve the previous problem but I create a second one.
PROBLEM: it has IDs duplicates when in a single year there are NA AND the ID for one of the IDs (last 2 lines in df)
回答1:
Here's a dplyr
solution:
library(dplyr)
df %>%
arrange(Value2) %>%
distinct(ID1, ID2, Year, .keep_all = T) %>%
arrange(ID2) %>%
distinct(ID1, Year, .keep_all = T) %>%
arrange(ID1) %>%
distinct(ID2, Year, .keep_all = T)
# ID1 ID2 Value1 Value2 Year
# 1 a1 b8 2 23 2004
# 2 a2 b6 NA 4 2008
# 3 a2 b6 4 NA 2009
# 4 a3 b7 5 6 2009
# 5 a4 b99 5 51 2004
# 6 a5 b2 2 23 2005
# 7 a6 b5 6 63 2004
# 8 a8 <NA> 4 NA 2014
# 9 a9 b9 6 4 2016
# 10 <NA> b7 NA 5 2008
When we arrange by Value2
the smaller values of Value
will be on top and distinct
will remove any duplicates and keep the 1st row it finds (i.e. the one with the smaller Value2
).
When we arrange by ID1
and then ID2
the NA
values will be on the bottom and distinct
will exclude them if they are duplicates.
Note that I'm using only Value2
to keep small values, as it's still not clear to me what you mean by "value".
来源:https://stackoverflow.com/questions/52814300/find-duplicate-compare-a-condition-erase-one-row-with-nas-r