Applying an aggregate function over multiple different slices

删除回忆录丶 提交于 2019-12-12 11:19:47

问题


I have a data array that contains some information about people and projects as such:

person_id | project_id | action | time
--------------------------------------
        1 |          1 |      w |    1
        1 |          2 |      w |    2
        1 |          3 |      w |    2
        1 |          3 |      r |    3
        1 |          3 |      w |    4
        1 |          4 |      w |    4
        2 |          2 |      r |    2
        2 |          2 |      w |    3

I'd like to augment this data with a couple of more fields called "first_time" and "first_time_project" that collectively identify first time any action by that person was seen and the first time that developer saw any action on the project. In the end, the data should look like this:

person_id | project_id | action | time | first_time | first_time_project
------------------------------------------------------------------------
        1 |          1 |      w |    1 |          1 |                  1
        1 |          2 |      w |    2 |          1 |                  2
        1 |          3 |      w |    2 |          1 |                  2
        1 |          3 |      r |    3 |          1 |                  2
        1 |          3 |      w |    4 |          1 |                  2
        1 |          4 |      w |    4 |          1 |                  4
        2 |          2 |      r |    2 |          2 |                  2
        2 |          2 |      w |    3 |          2 |                  2

My naive way of doing this to write a couple of loops:

for (pid in unique(data$person_id)) {
    data[data$pid==pid, "first_time"] = min(data[data$pid==pid, "time"])
    for (projid in unique(data[data$pid==pid, "project_id"])) {
        data[data$pid==pid & data$project_id==projid, "first_time_project"] = min(data[data$pid==pid & data$project_id==projid, "time"]
    }
}

Now, it doesn't take a genius to see that this is going to be glacially slow with the doubly nested loops. However, I can't figure out a way to handle this in R. I'm kinda emulating the group by option for SQL. I know that by might be able to help, but I can't figure out how to do multiple slices.

Any hints on how to take my code from glacially slow to something a bit faster? I'd be happy with a snail right now.


回答1:


Try ave :

transform(data, 
   first_time = ave(time, person_id, FUN = min),
   first_time_project = ave(time, person_id, project_id, drop = TRUE, FUN = min)
)



回答2:


The combination of Hadley's plyr and transform() is powerful. If I correctly understand your question, then:

foo <- ddply(foo, .(person_id), transform, first_time=min(time))
foo <- ddply(foo, .(person_id, project_id), transform, 
  first_time_project=min(time))



回答3:


If speed is what you are looking for, then data.table is the way to go.

library(data.table)
DT <- data.table(foo)
DT[, first_time := min(time), by = person_id]
DT[, first_time_project := min(time), by = list(person_id, project_id)]



回答4:


Quick and dirty solution with no loops

library(plyr)


# function to get first time by any person/project
fp <- function(dat) 
{
dat$first_time=min(dat$time)
ftp <- function(d) { d$first_time_project=min(d$time); return (d) }
dat=ddply(dat, .(project_id), ftp)
return (dat)
}


#this single call should give you the result you want
result=ddply(data, .(person_id), fp) 



回答5:


A quick way I can think of:

foo <- data.frame(
       person_id=rep(1:5,each=6),
       project_id=sample(1:5,30,T),
       time=sample(1:30))

first_time <- aggregate(foo$time, list(foo$person_id), min)

foo$first_time <- first_time[ match(foo$person_id,first_time[,1]),2]

bar <- subset(foo, time==first_time)

foo$first_time_project <- bar$project_id[match(foo$person_id, bar$person_id)]


来源:https://stackoverflow.com/questions/4998846/applying-an-aggregate-function-over-multiple-different-slices

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!