问题
I have a dataframe of the following form:
ID value modified
1 AA 30 2016-11-03
2 AB 40 2016-11-04
3 AC 50 2016-11-05
4 AA 60 2016-11-06
5 AB 20 2016-11-07
I want to identify all the duplicate rows for ID column and remove rows which has comparatively old modification time. So the output will be:
ID value modified
1 AC 50 2016-11-05
2 AA 60 2016-11-06
3 AB 20 2016-11-07
The code I am trying is as follows:
ID<-c('AA','AB','AD','AA','AB')
value<-c(30,40,50,60,20)
modified<-c('2016-11-03','2016-11-04','2016-11-05','2016-11-06','2016-11-07')
df<-data.frame(ID=ID,value=value,modified=modified)
df
ID value modified
1 AA 30 2016-11-03
2 AB 40 2016-11-04
3 AD 50 2016-11-05
4 AA 60 2016-11-06
5 AB 20 2016-11-07
df[!duplicated(df$ID),]
ID value modified
1 AA 30 2016-11-03
2 AB 40 2016-11-04
3 AD 50 2016-11-05
But this is not my desired output, how can I remove the old entries? Thank you in advance for any clue or hints.
回答1:
You can use the dplyr package as follows:
library(dplyr)
library(magrittr)
df %<>% group_by(ID) %>% filter(modified==max(modified))
And incase you want the result in a new variable
library(dplyr)
df2 <- df %>% group_by(ID) %>% filter(modified==max(modified))
回答2:
You can solve the problem with base R by first sorting the data frame by date:
df <- df[order(df[["modified"]], decreasing = TRUE), ]
Then you can get the final result with your !duplicated
solution:
df[!duplicated(df$ID), ]
来源:https://stackoverflow.com/questions/40877964/r-identify-duplicate-rows-and-remove-the-old-entryby-date