Working with unique values at scale (for loops, apply, or plyr)

问题

I'm not sure if this is possible, but if it is, it would make life oh so much more efficient.

The general problem that would be interesting to the wider SO community: for loops (and base functions like apply) are applicable for general/consistent operations, like adding X to every column or row of a data frame. I have a general/consistent operation I want to carry out, but with unique values for each element of the data frame.

Is there a way to do this more efficiently than subsetting my data frame for every grouping, applying the function with specific numbers relative to that grouping, then recombining? I don't care if it's a for loop or apply, but bonus points if it makes use of plyr functionality.

Here's the more specific problem I'm working on: I've got the data below. Ultimately what I want is a dataframe for time-series that has the date, and each column represents a region's relation to some benchmark.

The problem: the measure of interest for each region is different, and so is the benchmark. Here's the data:

library(dplyr)
library(reshape2)

data <- data.frame(
    region = sample(c("northeast","midwest","west"), 100, replace = TRUE),
    date = rep(seq(as.Date("2010-02-01"), length=10, by = "1 day"),10),
    population = sample(50000:100000, 10, replace = T),
    skiers = sample(1:100),
    bearsfans = sample(1:100),
    dudes = sample(1:100)
)

and the summary frame that I'm working off:

data2 <- data %.%
    group_by(date, region) %.%
    summarise(skiers = sum(skiers), 
            bearsfans= sum(bearsfans), 
            dudes = sum(dudes), 
            population = sum(population)) %.%
    mutate(ppl_per_skier = population/skiers,
            ppl_per_bearsfan = population/bearsfans,
            ppl_per_dude = population/dudes) %.%
    select(date, region, ppl_per_skier, ppl_per_bearsfan , ppl_per_dude)

Here's the tricky part:

For the Northeast, I only care about "ppl_per_skier", and the benchmark is 3500
For the Midwest, I only care about "ppl_per_bearsfan", and the benchmark is 1200
For the West, I only care about "ppl_per_dude", and the benchmark is 5000

Any of the ways I've come up with to tackle this problem involve creating subsets for each measure, but doing this at scale with hundreds of measures and different benchmarks is... not ideal. For example:

midwest <- data2 %.% 
            filter(region == "midwest") %.%
            select(date, region, ppl_per_bearsfan) %.%
            mutate(bmark = 1200, against_bmk = bmark/ppl_per_bearsfan-1) %.%
            select(date, against_bmk)

and likewise for each region, its respective measure, and its respective benchmark, then recombining them all together by date. Ultimately, I want something like this, where each region's performance against its specific benchmark and measure is laid out by date (this is fake data, of course):

        date midwest_againstbmk northeast_againstbmk west_againstbmk
1 2010-02-10          0.9617402            0.6008032       0.3403260
2 2010-02-11          0.5808621            0.5119942       0.7787559
3 2010-02-12          0.4828346            0.6560053       0.3747920
4 2010-02-13          0.6499841            0.7567194       0.8387461
5 2010-02-14          0.6367520            0.4564254       0.7269161

Is there a way to get to this sort of data and structure without having to do X number of subsets for each grouping, when I have unique measures and benchmark values for each group?

回答1:

Seems like an obvious use case for mapply:

> mapply(function(d,y,b) {(b/d[,y])-1},
         split(data2,data2$region), 
         c('ppl_per_bearsfan','ppl_per_skier','ppl_per_dude'), 
         c(1200,3500,5000))
          midwest   northeast      west
 [1,] -0.26625428 -0.02752186 3.5881957
 [2,]  0.48715638  1.89169295 2.6928546
 [3,] -0.94222992  1.26065537 4.0388343
 [4,] -0.38116663  0.79572184 1.4118364
 [5,] -0.05937874  2.05459482 1.8822015
 [6,] -0.41463925  1.60668461 1.5914408
 [7,] -0.31211391  1.21093777 2.7517886
 [8,] -0.88923466  0.44917981 1.2251965
 [9,] -0.02781965 -0.24637182 2.7143103
[10,] -0.46643682  1.28944776 0.6246315

来源：https://stackoverflow.com/questions/22116489/working-with-unique-values-at-scale-for-loops-apply-or-plyr

标签

for-loop

plyr

apply

dplyr