I have a sample dataframe \"data\" as follows:
X Y Month Year income
2281205 228120 3 2011 1000
2281212 228121 9 2010 1100
22812
I think the package dplyr is faster than plyr::ddply and more elegant.
testData <- read.table(file = "clipboard",header = TRUE)
require(dplyr)
testData %>%
group_by(Y) %>%
summarise(total = sum(income),freq = n()) %>%
filter(freq > 3)
As pointed out in a comment, you can do multiple operations inside the summarize.
This reduces your code to one line of ddply() and one line of subsetting, which is easy enough with the [ operator:
x <- ddply(data, .(Y), summarize, freq=length(Y), tot=sum(income))
x[x$freq > 3, ]
Y freq tot
3 228122 4 6778
This is also exceptionally easy with the data.table package:
library(data.table)
data.table(data)[, list(freq=length(income), tot=sum(income)), by=Y][freq > 3]
Y freq tot
1: 228122 4 6778
In fact, the operation to calculate the length of a vector has its own shortcut in data.table - use the .N shortcut:
data.table(data)[, list(freq=.N, tot=sum(income)), by=Y][freq > 3]
Y freq tot
1: 228122 4 6778