ddply for sum by group in R

后端 未结 2 2024
后悔当初
后悔当初 2020-12-02 23:30

I have a sample dataframe \"data\" as follows:

X            Y  Month   Year    income
2281205 228120  3   2011    1000
2281212 228121  9   2010    1100
22812         


        
相关标签:
2条回答
  • 2020-12-02 23:56

    I think the package dplyr is faster than plyr::ddply and more elegant.

    testData <- read.table(file = "clipboard",header = TRUE)
    require(dplyr)
    testData %>%
      group_by(Y) %>%
      summarise(total = sum(income),freq = n()) %>%
      filter(freq > 3)
    
    0 讨论(0)
  • 2020-12-03 00:11

    As pointed out in a comment, you can do multiple operations inside the summarize.

    This reduces your code to one line of ddply() and one line of subsetting, which is easy enough with the [ operator:

    x <- ddply(data, .(Y), summarize, freq=length(Y), tot=sum(income))
    x[x$freq > 3, ]
    
           Y freq  tot
    3 228122    4 6778
    

    This is also exceptionally easy with the data.table package:

    library(data.table)
    data.table(data)[, list(freq=length(income), tot=sum(income)), by=Y][freq > 3]
            Y freq  tot
    1: 228122    4 6778
    

    In fact, the operation to calculate the length of a vector has its own shortcut in data.table - use the .N shortcut:

    data.table(data)[, list(freq=.N, tot=sum(income)), by=Y][freq > 3]
            Y freq  tot
    1: 228122    4 6778
    
    0 讨论(0)
提交回复
热议问题