R: identify duplicate rows and remove the old entry(By Date)

问题

I have a dataframe of the following form:

   ID    value    modified
1  AA    30       2016-11-03
2  AB    40       2016-11-04
3  AC    50       2016-11-05
4  AA    60       2016-11-06
5  AB    20       2016-11-07

I want to identify all the duplicate rows for ID column and remove rows which has comparatively old modification time. So the output will be:

   ID    value    modified
1  AC    50       2016-11-05
2  AA    60       2016-11-06
3  AB    20       2016-11-07

The code I am trying is as follows:

ID<-c('AA','AB','AD','AA','AB')
value<-c(30,40,50,60,20)
modified<-c('2016-11-03','2016-11-04','2016-11-05','2016-11-06','2016-11-07')
df<-data.frame(ID=ID,value=value,modified=modified)
df
  ID value   modified
1 AA    30 2016-11-03
2 AB    40 2016-11-04
3 AD    50 2016-11-05
4 AA    60 2016-11-06
5 AB    20 2016-11-07

df[!duplicated(df$ID),]
  ID value   modified
1 AA    30 2016-11-03
2 AB    40 2016-11-04
3 AD    50 2016-11-05

But this is not my desired output, how can I remove the old entries? Thank you in advance for any clue or hints.

回答1:

You can use the dplyr package as follows:

library(dplyr)
library(magrittr)

df %<>% group_by(ID) %>% filter(modified==max(modified))

And incase you want the result in a new variable

library(dplyr)

df2 <- df %>% group_by(ID) %>% filter(modified==max(modified))

回答2:

You can solve the problem with base R by first sorting the data frame by date:

df <- df[order(df[["modified"]], decreasing = TRUE), ]

Then you can get the final result with your !duplicated solution:

df[!duplicated(df$ID), ]

来源：https://stackoverflow.com/questions/40877964/r-identify-duplicate-rows-and-remove-the-old-entryby-date

标签

dataframe

duplicates