Quite often I need to calculate a new column for an R dataframe (in long form), whose value should depend on an aggregate function (e.g. sum) of a group. For instance, I might want to know what fraction of sales a product accounts for on any given day:
daily fraction = revenue for product i on day d / sum or revenue for all products on day d
My current strategy is to summarise and join:
library(dplyr)
join_summary <- function(data, ...) left_join(data, summarise(data, ...))
data = data.frame(
day = c(1,1,2,2,3,3),
product = rep(c("A", "B"), 3),
revenue = c(2, 4, 8, 7, 9, 2)
)
data2 <- data %>%
group_by(day) %>%
join_summary(daily_revenue = sum(revenue)) %>%
mutate(revenue_fraction = revenue / daily_revenue)
This works, but I am not sure if it is an anti-pattern. It seems a bit inefficient to repeat the same data (the daily revenue) on multiple rows, littering my data frame with aggregates. My questions are:
- Is my current way of doing it OK?
- Is there any better way of doing it, preferably using tools from
dplyr
or the wider Hadleyverse? - Do I really need my custom function
join_summary
, or can it be done with the existingdplyr
verbs? (I prefer to stay within the "pipe flow", though.)
Besides using summarise
to aggregate, you can use mutate
to assign summary stats to the full column:
data %>%
group_by(day) %>%
mutate(
daily_revenue = sum(revenue),
revenue_fraction = revenue / daily_revenue
)
which gives
Source: local data frame [6 x 5]
Groups: day [3]
day product revenue daily_revenue revenue_fraction
<dbl> <fctr> <dbl> <dbl> <dbl>
1 1 A 2 6 0.3333333
2 1 B 4 6 0.6666667
3 2 A 8 15 0.5333333
4 2 B 7 15 0.4666667
5 3 A 9 11 0.8181818
6 3 B 2 11 0.1818182
This works because the value sum(revenue)
is recycled to fill in all rows in the group.
来源:https://stackoverflow.com/questions/42213465/how-to-calculate-new-column-depending-on-aggregate-function-on-group-using-dplyr