问题
I need to do something similar to below on a very large data set (with many groups), and read somewhere that using .SD is slow. Is there any faster way to perform the following operation?
To be more precise, I need to create a new column that contains the min value for each group after having excluded a subset of observations in that group (something similar to minif in Excel).
library(data.table)
dt <- data.table(valid = c(0,1,1,0,1),
a = c(1,1,2,3,4),
groups = c("A", "A", "A", "B", "B"))
dt[, valid_min := .SD[valid == 1, min(a, na.rm = TRUE)], by = groups]
With the output:
> test
valid a k valid_min
1: 0 1 A 1
2: 1 1 A 1
3: 1 2 A 1
4: 0 3 B 4
5: 1 4 B 4
To make it even more complicated, groups could have no valid entries or they could have multiple valid but missing entries. My current code is similar to this:
dt <- data.table(valid = c(0,1,1,0,1,0,1,1),
a = c(1,1,2,3,4,3,NA,NA),
k = c("A", "A", "A", "B", "B", "C", "D", "D"))
dt[, valid_min := .SD[valid == 1,
ifelse(all(is.na(a)), NA_real_, min(a, na.rm = TRUE))], by = k]
Output:
> dt
valid a k valid_min
1: 0 1 A 1
2: 1 1 A 1
3: 1 2 A 1
4: 0 3 B 4
5: 1 4 B 4
6: 0 3 C NA
7: 1 NA D NA
8: 1 NA D NA
回答1:
There's...
dt[dt[valid == 1 & !is.na(a), min(a), by=k], on=.(k), the_min := i.V1]
This should be fast since the inner call to min is optimized for groups. (See ?GForce
.)
回答2:
We can do the same using dplyr
dt %>%
group_by(groups) %>%
mutate(valid_min = min(ifelse(valid == 1,
a, NA),
na.rm = TRUE))
Which gives:
valid a groups valid_min
<dbl> <dbl> <chr> <dbl>
1 0 1 A 1
2 1 1 A 1
3 1 2 A 1
4 0 3 B 4
5 1 4 B 4
Alternatively, if you are not interested in keeping the 'non-valid' rows, we can do the following:
dt %>%
filter(valid == 1) %>%
group_by(groups) %>%
mutate(valid_min = min(a))
Looks like I provided the slowest approach. Comparing each approach (using a larger, replicated data frame called df
) with a microbenchmark test:
library(microbenchmark)
library(ggplot2)
mbm <- microbenchmark(
dplyr.test = suppressWarnings(df %>%
group_by(k) %>%
mutate(valid_min = min(ifelse(valid == 1,
a, NA),
na.rm = TRUE),
valid_min = ifelse(valid_min == Inf,
NA,
valid_min))),
data.table.test = df[, valid_min := .SD[valid == 1,
ifelse(all(is.na(a)), NA_real_, min(a, na.rm = TRUE))], by = k],
GForce.test = df[df[valid == 1 & !is.na(a), min(a), by=k], on=.(k), the_min := i.V1]
)
autoplot(mbm)
...well, i tried...
来源:https://stackoverflow.com/questions/46713569/fast-way-to-find-min-in-groups-after-excluding-observations-using-r