Beginner tips on using plyr to calculate year-over-year change across groups

前端 未结 2 1574
猫巷女王i
猫巷女王i 2020-12-15 01:27

I am new to plyr (and R) and looking for a little help to get started. Using the baseball dataset as an exaple, how could I calculate the year-over-year (yoy) change in \"a

2条回答
  •  夕颜
    夕颜 (楼主)
    2020-12-15 02:10

    I know you asked for a "plyr"-specific solution, but for the sake of sharing, here is an alternative approach in base R. In my opinion, I find the base R approach just as "readable". And, at least in this particular case, it's a lot faster!

    output <- within(df1, {
      yoy <- ave(ab, team, lg, FUN = function(x) c(NA, diff(x)))
    })
    head(output)
    #   year lg team   ab  yoy
    # 1 1884 UA  ALT  108   NA
    # 2 1997 AL  ANA 1703   NA
    # 3 1998 AL  ANA 1502 -201
    # 4 1999 AL  ANA  660 -842
    # 5 2000 AL  ANA   85 -575
    # 6 2001 AL  ANA  219  134
    
    library(rbenchmark)
    
    benchmark(DDPLY = {
      ddply(df1, .(team, lg), mutate ,
            yoy = c(NA, diff(ab)))
    }, WITHIN = {
      within(df1, {
        yoy <- ave(ab, team, lg, FUN = function(x) c(NA, diff(x)))
      })
    }, columns = c("test", "replications", "elapsed", 
                   "relative", "user.self"))
    #     test replications elapsed relative user.self
    # 1  DDPLY          100  10.675    4.974    10.609
    # 2 WITHIN          100   2.146    1.000     2.128
    

    Update: data.table

    If your data are very large, check out data.table. Even with this example, you'll find a good speedup in relative terms. Plus the syntax is super compact and, in my opinion, easily readable.

    library(plyr)
    df1 <- aggregate(ab~year+lg+team, FUN=sum, data=baseball)
    library(data.table)
    DT <- data.table(df1)
    DT
    #       year lg team   ab
    #    1: 1884 UA  ALT  108
    #    2: 1997 AL  ANA 1703
    #    3: 1998 AL  ANA 1502
    #    4: 1999 AL  ANA  660
    #    5: 2000 AL  ANA   85
    #   ---                  
    # 2523: 1895 NL  WSN  839
    # 2524: 1896 NL  WSN  982
    # 2525: 1897 NL  WSN 1426
    # 2526: 1898 NL  WSN 1736
    # 2527: 1899 NL  WSN  787
    

    Now, look at this concise solution:

    DT[, yoy := c(NA, diff(ab)), by = "team,lg"]
    DT
    #       year lg team   ab  yoy
    #    1: 1884 UA  ALT  108   NA
    #    2: 1997 AL  ANA 1703   NA
    #    3: 1998 AL  ANA 1502 -201
    #    4: 1999 AL  ANA  660 -842
    #    5: 2000 AL  ANA   85 -575
    #   ---                       
    # 2523: 1895 NL  WSN  839  290
    # 2524: 1896 NL  WSN  982  143
    # 2525: 1897 NL  WSN 1426  444
    # 2526: 1898 NL  WSN 1736  310
    # 2527: 1899 NL  WSN  787 -949
    

提交回复
热议问题