Select minimum data of grouped data - keeping all columns [duplicate]

问题

I am running into a wall here.

I have a dataframe, many rows. Here is schematic example.

#myDf
ID    c1    c2    myDate
A     1     1     01.01.2015
A     2     2     02.02.2014
A     3     3     03.01.2014
B     4     4     09.09.2009
B     5     5     10.10.2010
C     6     6     06.06.2011
....

I need to group my dataframe by my ID, and then select the row with the oldest date, and write the ouput into a new dataframe - keeping all rows.

ID    c1    c2    myDate
A     3     3     03.01.2014
B     4     4     09.09.2009
C     6     6     06.06.2011
....

That is how I approach it:

test <- myDf %>%
    group_by(ID) %>%
    mutate(date == as.Date(myDate, format = "%d.%m.%Y")) %>%
    filter(date == min(b2))

To verfiy: The nrow of my resulting dataframe should be the same as unique returns.

unique(myDf$ID) %>% length == nrow(test)

FALSE

Does not work. I tried this:

newDf <- ddply(.data = myDf,
              .variables = "ID",
              .fun = function(piece){
                  take.this.row <- piece$myDate %>% as.Date(format="%d.%m.%Y") %>% which.min
                  piece[take.this.row,]
                  })

That does run forever. I terminated it.

Why is the first approach not working and what would be a good way to approach the problem?

回答1:

Considering you have a pretty large dataset, I think using data.table will be better ! Here is the data.table version to solve your problem, it will be quicker than dplyr package:

library(data.table)
df <- data.table(ID=c("A","A","A","B","B","C"),c1=1:6,c2=1:6,
                 myDate=c("01.01.2015","02.02.2014",
                          "03.01.2014","09.09.2009","10.10.2010","06.06.2011"))
df[,myDate:=as.Date(myDate, '%d.%m.%Y')]

> df_new <- df[ df[, .I[myDate == min(myDate)], by=ID]$V1 ]
> df_new
   ID c1 c2     myDate
1:  A  3  3 2014-01-03
2:  B  4  4 2009-09-09
3:  C  6  6 2011-06-06

PS: you can use setDT(mydf) to transform data.frame to data.table.

回答2:

After grouping by 'ID', we can use which.min to get the index of 'myDate' (after converting to Date class), and we extract the rows with slice.

library(dplyr)
df1 %>% 
   group_by(ID) %>% 
   slice(which.min(as.Date(myDate, '%d.%m.%Y')))
#     ID    c1    c2     myDate
#  (chr) (int) (int)      (chr)
#1     A     3     3 03.01.2014
#2     B     4     4 09.09.2009
#3     C     6     6 06.06.2011

data

df1 <- structure(list(ID = c("A", "A", "A", "B", "B", "C"), c1 = 1:6, 
c2 = 1:6, myDate = c("01.01.2015", "02.02.2014", "03.01.2014", 
"09.09.2009", "10.10.2010", "06.06.2011")), .Names = c("ID", 
"c1", "c2", "myDate"), class = "data.frame", row.names = c(NA, 
 -6L))

回答3:

If you wanted to just use the base functions you can also go with the aggregate and merge functions.

# data (from response above)

df1 <- structure(list(ID = c("A", "A", "A", "B", "B", "C"), c1 = 1:6, 
                  c2 = 1:6, myDate = c("01.01.2015", "02.02.2014", "03.01.2014", 
                                       "09.09.2009", "10.10.2010", "06.06.2011")),
             .Names = c("ID","c1", "c2", "myDate"),
             class = "data.frame", row.names = c(NA,-6L))

# convert your date column to POSIXct object

df1$myDate = as.POSIXct(df1$myDate,format="%d.%m.%Y")

# Use the aggregate function to look for the minimum dates by group. 
# In this case our variable of interest in the myDate column and the
# group to sort by is the "ID" column.
# The function will sort out the minimum date and create a new data frame
# with names "myDate" and "ID"

df2 = aggregate(list(myDate = df1$myDate),list(ID = df1$ID),
            function(x){x[which(x == min(x))]})

df2

# Use the merge function to merge your original data frame with the
# data from the aggregate function

merge(df1,df2)

来源：https://stackoverflow.com/questions/33415297/select-minimum-data-of-grouped-data-keeping-all-columns

标签

dplyr

plyr