问题
Using the following reproducible example:
ID1<-c("a1","a4","a6","a6","a5", "a1" )
ID2<-c("b8","b99","b5","b5","b2","b8" )
Value1<-c(2,5,6,6,2,7)
Value2<- c(23,51,63,64,23,23)
Year<- c(2004,2004,2004,2004,2005,2004)
df<-data.frame(ID1,ID2,Value1,Value2,Year)
I want to select rows where ID1 and ID2 and Year have the same value in their respective columns. For this rows I want to compare Value1 and Value2 in the duplicates rows and IF the values are not the same erase the row with the smaller value.
Expected result:
ID1 ID2 Value1 Value2 Year new
2 a4 b99 5 51 2004 a4_b99_2004
4 a6 b5 6 64 2004 a6_b5_2004
5 a5 b2 2 23 2005 a5_b2_2005
6 a1 b8 7 23 2004 a1_b8_2004
I tried the following: Find a unique identifier for the conditions I am interested
df$new<-paste(df$ID1,df$ID2, df$Year, sep="_")
I can use the unique identifier to find the rows of the database that contain the duplicates
IND<-which(duplicated(df$new) | duplicated(df$new, fromLast = TRUE))
In a for loop if unique identifier has duplicate compare the values and erase the rows, but the loop is too complicated and I cannot solve it.
for (i in df$new) {
if(sum(df$new == i)>1)
{
ind<-which(df$new==i)
m= min(df$Value1[ind])
df<-df[-which.min(df$Value1[ind]),]
m= min(df$Value2[ind])
df<-df[-which.min(df$Value2[ind]),]
}
}
回答1:
Consider aggregate
to retrieve the max values by your grouping, ID1, ID2, and Year:
df_new <- aggregate(.~ID1 + ID2 + Year, df, max)
df_new
# ID1 ID2 Year Value1 Value2
# 1 a6 b5 2004 6 64
# 2 a1 b8 2004 7 23
# 3 a4 b99 2004 5 51
# 4 a5 b2 2005 2 23
回答2:
Some different possibilities. Using dplyr
:
df %>%
group_by(ID1, ID2, Year) %>%
filter(Value1 == max(Value1) & Value2 == max(Value2))
Or:
df %>%
rowwise() %>%
mutate(max_val = sum(Value1, Value2)) %>%
ungroup() %>%
group_by(ID1, ID2, Year) %>%
filter(max_val == max(max_val)) %>%
select(-max_val)
Using data.table
:
setDT(df)[df[, .I[Value1 == max(Value1) & Value2 == max(Value2)], by = list(ID1, ID2, Year)]$V1]
Or:
setDT(df)[, max_val := sum(Value1, Value2), by = 1:nrow(df)
][, filter := max_val == max(max_val), by = list(ID1, ID2, Year)
][filter != FALSE
][, -c("max_val", "filter")]
Or:
subset(setDT(df)[, max_val := sum(Value1, Value2), by = 1:nrow(df)
][, filter := max_val == max(max_val), by = list(ID1, ID2, Year)], filter != FALSE)[, -c("max_val", "filter")]
回答3:
Solution without loading libraries:
ID1 ID2 Value1 Value2 Year
a6.b5.2004 a6 b5 6 64 2004
a1.b8.2004 a1 b8 7 23 2004
a4.b99.2004 a4 b99 5 51 2004
a5.b2.2005 a5 b2 2 23 2005
Code
do.call(rbind, lapply(split(df, list(df$ID1, df$ID2, df$Year)), # make identifiers
function(x) {return(x[which.max(x$Value1 + x$Value2),])})) # take max of sum
来源:https://stackoverflow.com/questions/52803902/find-duplicate-compare-a-condition-erase-one-row-r