In R, I want to summarize my data after grouping it based on the runs of a variable x (aka each group of the data corresponds to a subset of the data where consecutive x values are the same). For instance, consider the following data frame, where I want to compute the average y value within each run of x:
(dat In this example, the x variable has runs of length 3, then 2, then 1, and finally 1, taking values 1, 2, 1, and 2 in those four runs. The corresponding means of y in those groups are 2, 4.5, 6, and 7.
It is easy to carry out this grouped operation in base R using tapply, passing dat$y as the data, using rle to compute the run number from dat$x, and passing the desired summary function:
tapply(dat$y, with(rle(dat$x), rep(seq_along(lengths), lengths)), mean) # 1 2 3 4 # 2.0 4.5 6.0 7.0 I figured I would be able to pretty directly carry over this logic to dplyr, but my attempts so far have all ended in errors:
library(dplyr) # First attempt dat %>% group_by(with(rle(x), rep(seq_along(lengths), lengths))) %>% summarize(mean(y)) # Error: cannot coerce type 'closure' to vector of type 'integer' # Attempt 2 -- maybe "with" is the problem? dat %>% group_by(rep(seq_along(rle(x)$lengths), rle(x)$lengths)) %>% summarize(mean(y)) # Error: invalid subscript type 'closure' For completeness, I could reimplement the rle run id myself using cumsum, head, and tail to get around this, but it makes the grouping code tougher to read and involves a bit of reinventing the wheel:
dat %>% group_by(run=cumsum(c(1, head(x, -1) != tail(x, -1)))) %>% summarize(mean(y)) # run mean(y) # (dbl) (dbl) # 1 1 2.0 # 2 2 4.5 # 3 3 6.0 # 4 4 7.0 What is causing my rle-based grouping code to fail in dplyr, and is there any solution that enables me to keep using rle when grouping by run id?