Beginner tips on using plyr to calculate year-over-year change across groups

前端 未结 2 1544
猫巷女王i
猫巷女王i 2020-12-15 01:27

I am new to plyr (and R) and looking for a little help to get started. Using the baseball dataset as an exaple, how could I calculate the year-over-year (yoy) change in \"a

相关标签:
2条回答
  • 2020-12-15 02:10

    I know you asked for a "plyr"-specific solution, but for the sake of sharing, here is an alternative approach in base R. In my opinion, I find the base R approach just as "readable". And, at least in this particular case, it's a lot faster!

    output <- within(df1, {
      yoy <- ave(ab, team, lg, FUN = function(x) c(NA, diff(x)))
    })
    head(output)
    #   year lg team   ab  yoy
    # 1 1884 UA  ALT  108   NA
    # 2 1997 AL  ANA 1703   NA
    # 3 1998 AL  ANA 1502 -201
    # 4 1999 AL  ANA  660 -842
    # 5 2000 AL  ANA   85 -575
    # 6 2001 AL  ANA  219  134
    
    library(rbenchmark)
    
    benchmark(DDPLY = {
      ddply(df1, .(team, lg), mutate ,
            yoy = c(NA, diff(ab)))
    }, WITHIN = {
      within(df1, {
        yoy <- ave(ab, team, lg, FUN = function(x) c(NA, diff(x)))
      })
    }, columns = c("test", "replications", "elapsed", 
                   "relative", "user.self"))
    #     test replications elapsed relative user.self
    # 1  DDPLY          100  10.675    4.974    10.609
    # 2 WITHIN          100   2.146    1.000     2.128
    

    Update: data.table

    If your data are very large, check out data.table. Even with this example, you'll find a good speedup in relative terms. Plus the syntax is super compact and, in my opinion, easily readable.

    library(plyr)
    df1 <- aggregate(ab~year+lg+team, FUN=sum, data=baseball)
    library(data.table)
    DT <- data.table(df1)
    DT
    #       year lg team   ab
    #    1: 1884 UA  ALT  108
    #    2: 1997 AL  ANA 1703
    #    3: 1998 AL  ANA 1502
    #    4: 1999 AL  ANA  660
    #    5: 2000 AL  ANA   85
    #   ---                  
    # 2523: 1895 NL  WSN  839
    # 2524: 1896 NL  WSN  982
    # 2525: 1897 NL  WSN 1426
    # 2526: 1898 NL  WSN 1736
    # 2527: 1899 NL  WSN  787
    

    Now, look at this concise solution:

    DT[, yoy := c(NA, diff(ab)), by = "team,lg"]
    DT
    #       year lg team   ab  yoy
    #    1: 1884 UA  ALT  108   NA
    #    2: 1997 AL  ANA 1703   NA
    #    3: 1998 AL  ANA 1502 -201
    #    4: 1999 AL  ANA  660 -842
    #    5: 2000 AL  ANA   85 -575
    #   ---                       
    # 2523: 1895 NL  WSN  839  290
    # 2524: 1896 NL  WSN  982  143
    # 2525: 1897 NL  WSN 1426  444
    # 2526: 1898 NL  WSN 1736  310
    # 2527: 1899 NL  WSN  787 -949
    
    0 讨论(0)
  • 2020-12-15 02:30

    How about using diff():

    df <- read.table(header = TRUE, text = '  year lg team   ab
      1884 UA  ALT  108
      1997 AL  ANA 1703
      1998 AL  ANA 1502
      1999 AL  ANA  660
      2000 AL  ANA   85
      2001 AL  ANA  219')
    require(plyr)
    ddply(df, .(team, lg), mutate ,
          yoy = c(NA, diff(ab)))
    #   year lg team   ab  yoy
    1 1884 UA  ALT  108   NA
    2 1997 AL  ANA 1703   NA
    3 1998 AL  ANA 1502 -201
    4 1999 AL  ANA  660 -842
    5 2000 AL  ANA   85 -575
    6 2001 AL  ANA  219  134
    
    0 讨论(0)
提交回复
热议问题