问题
I am running into a wall here.
I have a dataframe
, many rows.
Here is schematic example.
#myDf
ID c1 c2 myDate
A 1 1 01.01.2015
A 2 2 02.02.2014
A 3 3 03.01.2014
B 4 4 09.09.2009
B 5 5 10.10.2010
C 6 6 06.06.2011
....
I need to group my dataframe
by my ID
, and then select the row with the oldest date, and write the ouput into a new dataframe - keeping all rows.
ID c1 c2 myDate
A 3 3 03.01.2014
B 4 4 09.09.2009
C 6 6 06.06.2011
....
That is how I approach it:
test <- myDf %>%
group_by(ID) %>%
mutate(date == as.Date(myDate, format = "%d.%m.%Y")) %>%
filter(date == min(b2))
To verfiy: The nrow
of my resulting dataframe should be the same as unique
returns.
unique(myDf$ID) %>% length == nrow(test)
FALSE
Does not work. I tried this:
newDf <- ddply(.data = myDf,
.variables = "ID",
.fun = function(piece){
take.this.row <- piece$myDate %>% as.Date(format="%d.%m.%Y") %>% which.min
piece[take.this.row,]
})
That does run forever. I terminated it.
Why is the first approach not working and what would be a good way to approach the problem?
回答1:
Considering you have a pretty large dataset, I think using data.table will be better ! Here is the data.table version to solve your problem, it will be quicker than dplyr package:
library(data.table)
df <- data.table(ID=c("A","A","A","B","B","C"),c1=1:6,c2=1:6,
myDate=c("01.01.2015","02.02.2014",
"03.01.2014","09.09.2009","10.10.2010","06.06.2011"))
df[,myDate:=as.Date(myDate, '%d.%m.%Y')]
> df_new <- df[ df[, .I[myDate == min(myDate)], by=ID]$V1 ]
> df_new
ID c1 c2 myDate
1: A 3 3 2014-01-03
2: B 4 4 2009-09-09
3: C 6 6 2011-06-06
PS: you can use setDT(mydf) to transform data.frame to data.table.
回答2:
After grouping by 'ID', we can use which.min
to get the index of 'myDate' (after converting to Date
class), and we extract the rows with slice
.
library(dplyr)
df1 %>%
group_by(ID) %>%
slice(which.min(as.Date(myDate, '%d.%m.%Y')))
# ID c1 c2 myDate
# (chr) (int) (int) (chr)
#1 A 3 3 03.01.2014
#2 B 4 4 09.09.2009
#3 C 6 6 06.06.2011
data
df1 <- structure(list(ID = c("A", "A", "A", "B", "B", "C"), c1 = 1:6,
c2 = 1:6, myDate = c("01.01.2015", "02.02.2014", "03.01.2014",
"09.09.2009", "10.10.2010", "06.06.2011")), .Names = c("ID",
"c1", "c2", "myDate"), class = "data.frame", row.names = c(NA,
-6L))
回答3:
If you wanted to just use the base functions you can also go with the aggregate and merge functions.
# data (from response above)
df1 <- structure(list(ID = c("A", "A", "A", "B", "B", "C"), c1 = 1:6,
c2 = 1:6, myDate = c("01.01.2015", "02.02.2014", "03.01.2014",
"09.09.2009", "10.10.2010", "06.06.2011")),
.Names = c("ID","c1", "c2", "myDate"),
class = "data.frame", row.names = c(NA,-6L))
# convert your date column to POSIXct object
df1$myDate = as.POSIXct(df1$myDate,format="%d.%m.%Y")
# Use the aggregate function to look for the minimum dates by group.
# In this case our variable of interest in the myDate column and the
# group to sort by is the "ID" column.
# The function will sort out the minimum date and create a new data frame
# with names "myDate" and "ID"
df2 = aggregate(list(myDate = df1$myDate),list(ID = df1$ID),
function(x){x[which(x == min(x))]})
df2
# Use the merge function to merge your original data frame with the
# data from the aggregate function
merge(df1,df2)
来源:https://stackoverflow.com/questions/33415297/select-minimum-data-of-grouped-data-keeping-all-columns