I have a sample dataframe \"data\" as follows:
X Y Month Year income
2281205 228120 3 2011 1000
2281212 228121 9 2010 1100
22812
I think the package dplyr
is faster than plyr::ddply
and more elegant.
testData <- read.table(file = "clipboard",header = TRUE)
require(dplyr)
testData %>%
group_by(Y) %>%
summarise(total = sum(income),freq = n()) %>%
filter(freq > 3)
As pointed out in a comment, you can do multiple operations inside the summarize
.
This reduces your code to one line of ddply()
and one line of subsetting, which is easy enough with the [
operator:
x <- ddply(data, .(Y), summarize, freq=length(Y), tot=sum(income))
x[x$freq > 3, ]
Y freq tot
3 228122 4 6778
This is also exceptionally easy with the data.table
package:
library(data.table)
data.table(data)[, list(freq=length(income), tot=sum(income)), by=Y][freq > 3]
Y freq tot
1: 228122 4 6778
In fact, the operation to calculate the length of a vector has its own shortcut in data.table
- use the .N
shortcut:
data.table(data)[, list(freq=.N, tot=sum(income)), by=Y][freq > 3]
Y freq tot
1: 228122 4 6778