How does one aggregate and summarize data quickly?

那年仲夏 提交于 2019-11-30 05:03:06
Ramnath

You should look at the package data.table for faster aggregation operations on large data frames. For your problem, the solution would look like:

library(data.table)
data_t = data.table(data_tab)
ans = data_t[,list(A = sum(count), B = mean(count)), by = 'PID,Time,Site']

Let's see how fast data.table is and compare to using dplyr. Thishis would be roughly the way to do it in dplyr.

data %>% group_by(PID, Time, Site, Rep) %>%
    summarise(totalCount = sum(Count)) %>%
    group_by(PID, Time, Site) %>% 
    summarise(mean(totalCount))

Or perhaps this, depending on exactly how the question is interpreted:

    data %>% group_by(PID, Time, Site) %>%
        summarise(totalCount = sum(Count), meanCount = mean(Count)  

Here is a full example of these alternatives versus @Ramnath proposed answer and the one @David Arenburg proposed in the comments , which I think is equivalent to the second dplyr statement.

nrow <- 510000
data <- data.frame(PID = sample(letters, nrow, replace = TRUE), 
                   Time = sample(letters, nrow, replace = TRUE),
                   Site = sample(letters, nrow, replace = TRUE),
                   Rep = rnorm(nrow),
                   Count = rpois(nrow, 100))


library(dplyr)
library(data.table)

Rprof(tf1 <- tempfile())
ans <- data %>% group_by(PID, Time, Site, Rep) %>%
    summarise(totalCount = sum(Count)) %>%
    group_by(PID, Time, Site) %>% 
    summarise(mean(totalCount))
Rprof()
summaryRprof(tf1)  #reports 1.68 sec sampling time

Rprof(tf2 <- tempfile())
ans <- data %>% group_by(PID, Time, Site, Rep) %>%
    summarise(total = sum(Count), meanCount = mean(Count)) 
Rprof()
summaryRprof(tf2)  # reports 1.60 seconds

Rprof(tf3 <- tempfile())
data_t = data.table(data)
ans = data_t[,list(A = sum(Count), B = mean(Count)), by = 'PID,Time,Site']
Rprof()
summaryRprof(tf3)  #reports 0.06 seconds

Rprof(tf4 <- tempfile())
ans <- setDT(data)[,.(A = sum(Count), B = mean(Count)), by = 'PID,Time,Site']
Rprof()
summaryRprof(tf4)  #reports 0.02 seconds

The data table method is much faster, and the setDT is even faster!

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!