Efficient functional programming (using mapply) in R for a “naturally” procedural problem

穿精又带淫゛_ 提交于 2020-01-21 06:01:05

问题


A common use case in R (at least for me) is identifying observations in a data frame that have some characteristic that depends on the values in some subset of other observations.

To make this more concerete, suppose I have a number of workers (indexed by WorkerId) that have an associated "Iteration":

    raw <- data.frame(WorkerId=c(1,1,1,1,2,2,2,2,3,3,3,3),
              Iteration = c(1,2,3,4,1,2,3,4,1,2,3,4))

and I want to eventually subset the data frame to exclude the "last" iteration (by creating a "remove" boolean) for each worker. I can write a function to do this:

raw$remove <- mapply(function(wid,iter){
                              iter==max(raw$Iteration[raw$WorkerId==wid])},
                 raw$WorkerId, raw$Iteration)

> raw$remove
  [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE

but this gets very slow as the data frame gets larger (presumably because I'm needlessly computing the max for every observation).

My question is what's the more efficient (and idiomatic) way of doing this in the functional programming style. Is it first creating a the WorkerId to Max value dictionary and then using that as a parameter in another function that operates on each observation?


回答1:


The "most natural way" IMO is the split-lapply-rbind method. You start by split()-ting into a list of groups, then lapply() the processing rule (in this case removing the last row) and then rbind() them back together. It's all doable as a nested set of function calls. The inner two steps are illustrated here and the final one-liner is presented at the bottom:

> lapply( split(raw, raw$WorkerId), function(x) x[-NROW(x),] )
$`1`
  WorkerId Iteration
1        1         1
2        1         2
3        1         3

$`2`
  WorkerId Iteration
5        2         1
6        2         2
7        2         3

$`3`
   WorkerId Iteration
9         3         1
10        3         2
11        3         3

do.call(rbind,  lapply( split(raw, raw$WorkerId), function(x) x[-NROW(x),] ) ) 

Hadley Wickham has developed a wide set of tools, the plyr package, that extend this strategy to a wider variety of tasks.




回答2:


For the specific problem posed !rev(duplicated(rev(raw$WorkerId))) or better, following Charles' advice, !duplicated(raw$WorkerId, fromLast=TRUE)




回答3:


This situation is tailor-made for using the plyr package.

ddply(raw, .(WorkerId), function(df) df[-NROW(df),])

It produces the output

WorkerId Iteration
1        1         1
2        1         2
3        1         3
4        2         1
5        2         2
6        2         3
7        3         1
8        3         2
9        3         3



回答4:


subset(raw, Iteration != ave(Iteration, WorkerId, FUN=max))



回答5:


remove <- with(raw, as.logical(ave(Iteration, WorkerId, FUN=function(x) c(rep(TRUE, length(x)-1), FALSE)))))


来源:https://stackoverflow.com/questions/6167791/efficient-functional-programming-using-mapply-in-r-for-a-naturally-procedura

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!