Fast way to find min in groups after excluding observations using R

问题

I need to do something similar to below on a very large data set (with many groups), and read somewhere that using .SD is slow. Is there any faster way to perform the following operation?

To be more precise, I need to create a new column that contains the min value for each group after having excluded a subset of observations in that group (something similar to minif in Excel).

library(data.table)
dt <- data.table(valid = c(0,1,1,0,1),
                   a = c(1,1,2,3,4),
                   groups = c("A", "A", "A", "B", "B"))

dt[, valid_min := .SD[valid == 1, min(a, na.rm = TRUE)], by = groups]

With the output:

> test
valid a k valid_min
1:     0 1 A         1
2:     1 1 A         1
3:     1 2 A         1
4:     0 3 B         4
5:     1 4 B         4

To make it even more complicated, groups could have no valid entries or they could have multiple valid but missing entries. My current code is similar to this:

dt <- data.table(valid = c(0,1,1,0,1,0,1,1),
                 a = c(1,1,2,3,4,3,NA,NA),
                 k = c("A", "A", "A", "B", "B", "C", "D", "D"))

dt[, valid_min := .SD[valid == 1, 
                      ifelse(all(is.na(a)), NA_real_, min(a, na.rm = TRUE))], by = k]

Output:

> dt
valid  a k valid_min
1:     0  1 A         1
2:     1  1 A         1
3:     1  2 A         1
4:     0  3 B         4
5:     1  4 B         4
6:     0  3 C        NA
7:     1 NA D        NA
8:     1 NA D        NA

回答1:

There's...

dt[dt[valid == 1 & !is.na(a), min(a), by=k], on=.(k), the_min := i.V1]

This should be fast since the inner call to min is optimized for groups. (See ?GForce.)

回答2:

We can do the same using dplyr

dt %>% 
  group_by(groups) %>% 
  mutate(valid_min = min(ifelse(valid == 1,
                                a, NA),
                         na.rm = TRUE))

Which gives:

  valid     a groups valid_min
  <dbl> <dbl>  <chr>     <dbl>
1     0     1      A         1
2     1     1      A         1
3     1     2      A         1
4     0     3      B         4
5     1     4      B         4

Alternatively, if you are not interested in keeping the 'non-valid' rows, we can do the following:

dt %>% 
  filter(valid == 1) %>% 
  group_by(groups) %>% 
  mutate(valid_min = min(a))

Looks like I provided the slowest approach. Comparing each approach (using a larger, replicated data frame called df) with a microbenchmark test:

library(microbenchmark)
library(ggplot2)
mbm <- microbenchmark(
  dplyr.test = suppressWarnings(df %>% 
                                  group_by(k) %>% 
                                  mutate(valid_min = min(ifelse(valid == 1,
                                                                a, NA),
                                                         na.rm = TRUE),
                                         valid_min = ifelse(valid_min == Inf,
                                                            NA,
                                                            valid_min))),


  data.table.test = df[, valid_min := .SD[valid == 1, 
                                          ifelse(all(is.na(a)), NA_real_, min(a, na.rm = TRUE))], by = k],
  GForce.test = df[df[valid == 1 & !is.na(a), min(a), by=k], on=.(k), the_min := i.V1]
)

autoplot(mbm)

...well, i tried...

来源：https://stackoverflow.com/questions/46713569/fast-way-to-find-min-in-groups-after-excluding-observations-using-r

标签

performance

data.table