find duplicate, compare a condition, erase one row - with NAs R

问题

I am building upon this question find duplicate, compare a condition, erase one row r to solve a more complicated case.

Using the following reproducible example:

ID1<-c("a1","a4","a6","a6","a5", "a1",NA,"a3", "a2","a2", "a8", "a9", "a9")
ID2<-c("b8","b99","b5","b5","b2","b8" , "b7","b7", "b6","b6",NA,"b9",NA)
Value1<-c(2,5,6,6,2,7, NA,5,NA,4,4,6,6)
Value2<- c(23,51,63,64,23,23,5,6,4,NA,NA,4,NA)
Year<- c(2004,2004,2004,2004,2005,2004,2008,2009, 2008,2009,2014,2016,2016)
df<-data.frame(ID1,ID2,Value1,Value2,Year)

I want to select rows where ID1 and ID2 and Year have the same value in their respective columns. For this rows I want to compare Value1 and Value2 in the duplicates rows and IF the values are not the same erase the row with within the column higher value (because of the data structure this will be unambiguous ).

Expected Results:

Expected

#     ID1  ID2 Value1 Value2 Year
# 1    a1   b8      2     23 2004
# 2    a4  b99      5     51 2004
# 3    a6   b5      6     63 2004

# 5    a5   b2      2     23 2005

# 7  <NA>   b7     NA      5 2008
# 8    a3   b7      5      6 2009
# 9    a2   b6     NA      4 2008
# 10   a2   b6      4     NA 2009
# 11   a8 <NA>      4     NA 2014
# 12   a9   b9      6      4 2016

First solution:

df_new <- aggregate(.~ID1 + ID2 + Year, df, min, na.action = na.pass)

PROBLEM: it deletes raws when one of the IDs is NA

I then changed NAs to a character value

df$ID1[is.na(df$ID1)] <- "Missing_data"
df$ID2[is.na(df$ID2)] <- "Missing_data"

df_new <- aggregate(.~ID1 + ID2 + Year, df, min, na.action = na.pass)

I solve the previous problem but I create a second one.

PROBLEM: it has IDs duplicates when in a single year there are NA AND the ID for one of the IDs (last 2 lines in df)

回答1:

Here's a dplyr solution:

library(dplyr)

df %>%
  arrange(Value2) %>%             
  distinct(ID1, ID2, Year, .keep_all = T) %>%    
  arrange(ID2) %>%
  distinct(ID1, Year, .keep_all = T) %>%  
  arrange(ID1) %>%
  distinct(ID2, Year, .keep_all = T)

#      ID1  ID2 Value1 Value2 Year
# 1    a1   b8      2     23 2004
# 2    a2   b6     NA      4 2008
# 3    a2   b6      4     NA 2009
# 4    a3   b7      5      6 2009
# 5    a4  b99      5     51 2004
# 6    a5   b2      2     23 2005
# 7    a6   b5      6     63 2004
# 8    a8 <NA>      4     NA 2014
# 9    a9   b9      6      4 2016
# 10 <NA>   b7     NA      5 2008

When we arrange by Value2 the smaller values of Value will be on top and distinct will remove any duplicates and keep the 1st row it finds (i.e. the one with the smaller Value2).

When we arrange by ID1 and then ID2 the NA values will be on the bottom and distinct will exclude them if they are duplicates.

Note that I'm using only Value2 to keep small values, as it's still not clear to me what you mean by "value".

来源：https://stackoverflow.com/questions/52814300/find-duplicate-compare-a-condition-erase-one-row-with-nas-r

标签

if-statement

duplicates

aggregate