dplyr, lubridate : how to aggregate a dataframe by week?

问题

Consider the following example

library(tidyverse)
library(lubridate)
time <- seq(from =ymd("2014-02-24"),to= ymd("2014-03-20"), by="days")
set.seed(123)
values <- sample(seq(from = 20, to = 50, by = 5), size = length(time), replace = TRUE)
df2 <- data_frame(time, values)
df2 <- df2 %>% mutate(day_of_week = wday(time, label = TRUE))

Source: local data frame [25 x 3]

         time values day_of_week
       <date>  <dbl>      <fctr>
1  2014-02-24     30         Mon
2  2014-02-25     45        Tues
3  2014-02-26     30         Wed
4  2014-02-27     50       Thurs
5  2014-02-28     50         Fri
6  2014-03-01     20         Sat
7  2014-03-02     35         Sun
8  2014-03-03     50         Mon
9  2014-03-04     35        Tues
10 2014-03-05     35         Wed

I would like to aggregate this dataframe by week.

That is, suppose I define a week as starting on Monday morning and ending on Sunday evening, which we will call a Monday to Monday cycle. (importantly, I want to be able to choose other conventions, such as Friday to Friday for instance).

Then, I would simply like to count the mean of values for each week.

For instance, in the example above, one would compute the average of values between Monday February 24th to Sunday March 2nd, and so on.

How can I do that?

Thanks!

EDIT: thanks to all of you who contributed an idea. Somewhat unusual, I think my late solution is probably more appropriate here. Thanks again!

回答1:

In the tidyverse,

df2 %>% group_by(week = week(time)) %>% summarise(value = mean(values))

## # A tibble: 5 × 2
##    week    value
##   <dbl>    <dbl>
## 1     8 37.50000
## 2     9 38.57143
## 3    10 38.57143
## 4    11 36.42857
## 5    12 45.00000

Or use isoweek instead:

df2 %>% group_by(week = isoweek(time)) %>% summarise(value = mean(values))

## # A tibble: 4 × 2
##    week    value
##   <int>    <dbl>
## 1     9 37.14286
## 2    10 40.71429
## 3    11 35.00000
## 4    12 42.50000

Or cut.Date:

df2 %>% group_by(week = cut(time, "week")) %>% summarise(value = mean(values))

## # A tibble: 4 × 2
##         week    value
##       <fctr>    <dbl>
## 1 2014-02-24 37.14286
## 2 2014-03-03 40.71429
## 3 2014-03-10 35.00000
## 4 2014-03-17 42.50000

which you can tell to start on Sunday, if you prefer:

df2 %>% group_by(week = cut(time, "week", start.on.monday = FALSE)) %>% 
    summarise(value = mean(values))

## # A tibble: 4 × 2
##         week    value
##       <fctr>    <dbl>
## 1 2014-02-23 37.50000
## 2 2014-03-02 40.00000
## 3 2014-03-09 33.57143
## 4 2014-03-16 44.00000

If you want to shift to, say, Tuesday start, add one to your dates:

df2 %>% group_by(week = cut(time + 1, "week")) %>% summarise(value = mean(values))

## # A tibble: 4 × 2
##         week    value
##       <fctr>    <dbl>
## 1 2014-02-24 37.50000
## 2 2014-03-03 40.00000
## 3 2014-03-10 33.57143
## 4 2014-03-17 44.00000

Labels will be off, though. If using cut, consider the implications of its include.lowest and right parameters, documented at ?cut.

回答2:

why not straight up use floor_date and an integer to adjust the start date of the week?

library(lubridate)
time <- seq(from =ymd("2014-02-24"),to= ymd("2014-03-20"), by="days")

set.seed(123)

values <- sample(seq(from = 20, to = 50, by = 5), size = length(time), replace = TRUE)  
df2 <- data_frame(time, values)
df2 <- df2 %>% mutate(day_of_week = weekdays(time))

# week wednesday to tuesday
df2 %>% group_by(Week = floor_date(time-3, unit="week")) %>% 
  summarize(WeeklyAveDist=mean(values), mean(values), min_date = min(time), max_date = max(time)) %>% mutate(weekdays(min_date), weekdays(max_date)))

        Week WeeklyAveDist mean.values.   min_date   max_date
1 2014-02-16      37.50000     37.50000 2014-02-24 2014-02-25
2 2014-02-23      38.57143     38.57143 2014-02-26 2014-03-04
3 2014-03-02      38.57143     38.57143 2014-03-05 2014-03-11
4 2014-03-09      36.42857     36.42857 2014-03-12 2014-03-18
5 2014-03-16      45.00000     45.00000 2014-03-19 2014-03-20
  weekdays.min_date. weekdays.max_date.
1             Monday            Tuesday
2          Wednesday            Tuesday
3          Wednesday            Tuesday
4          Wednesday            Tuesday
5          Wednesday           Thursday


# Week Thursday to Wednesday
df2 %>% group_by(Week = floor_date(time-4, unit="week")) %>% 
  summarize(WeeklyAveDist=mean(values), mean(values), min_date = min(time), max_date = max(time)) %>% mutate(weekdays(min_date), weekdays(max_date)))

        Week WeeklyAveDist mean.values.   min_date   max_date
1 2014-02-16      35.00000     35.00000 2014-02-24 2014-02-26
2 2014-02-23      39.28571     39.28571 2014-02-27 2014-03-05
3 2014-03-02      37.14286     37.14286 2014-03-06 2014-03-12
4 2014-03-09      40.00000     40.00000 2014-03-13 2014-03-19
5 2014-03-16      40.00000     40.00000 2014-03-20 2014-03-20
  weekdays.min_date. weekdays.max_date.
1             Monday          Wednesday
2           Thursday          Wednesday
3           Thursday          Wednesday
4           Thursday          Wednesday
5           Thursday           Thursday

回答3:

aggregate(df2$values,by=list(week(df2$time)),mean)

  Group.1        x
1       8 30.00000
2       9 40.00000
3      10 36.42857
4      11 37.85714
5      12 43.33333

This uses the week function of lubridate and gives the week number of the week in the year.

To control which day of the week is the starting day just refer to this thread on that topic:

Changing lubridate function to start on Monday rather than Sunday

The solution from that thread by nograpes suggests that if you want a custom version of the week() function using an arbitrary day of the week as the beginning of the week that you just construct it from base R like this:

start.of.week <- function(date)
  date - (setNames(c(6,0:5),0:6) [strftime(date,'%w')])

end.of.week <- function(date)
  date + (setNames(c(0,6:1),0:6) [strftime(date,'%w')])

start.of.week(as.Date(c('2014-01-05','2014-10-02','2014-09-22','2014-09-27')))
# "2013-12-30" "2014-09-29" "2014-09-22" "2014-09-22"
end.of.week(as.Date(c('2014-01-05','2014-10-02','2014-09-22','2014-09-27')))
# "2014-01-05" "2014-10-05" "2014-09-28" "2014-09-28"

In the future lubridate will have this option for an arbitrary start day for weeks, but Hadley hasn't got around to adding it yet (https://github.com/hadley/lubridate/issues/257).

回答4:

Just this once, after some research, I actually think I came up with a better solution that

gives the correct aggregation
gives the correct labels

Example below for weeks starting on a thursday. The weeks will be labeled by their first day a given cycle.

library(tidyverse)
library(lubridate)
options(tibble.print_min = 30)

time <- seq(from =ymd("2014-02-24"),to= ymd("2014-03-20"), by="days")
set.seed(123)
values <- sample(seq(from = 20, to = 50, by = 5), size = length(time), replace = TRUE)
df2 <- data_frame(time, values)

df2 <- df2 %>% mutate(day_of_week_label = wday(time, label = TRUE),
                      day_of_week = wday(time, label = FALSE))

df2 <- df2 %>% mutate(thursday_cycle = time - ((as.integer(day_of_week) - 5) %% 7),
                      tmp_1 = (as.integer(day_of_week) - 5),
                      tmp_2 = ((as.integer(day_of_week) - 5) %% 7))

which gives

> df2
# A tibble: 25 × 7
         time values day_of_week_label day_of_week thursday_cycle tmp_1 tmp_2
       <date>  <dbl>             <ord>       <dbl>         <date> <dbl> <dbl>
1  2014-02-24     30               Mon           2     2014-02-20    -3     4
2  2014-02-25     45              Tues           3     2014-02-20    -2     5
3  2014-02-26     30               Wed           4     2014-02-20    -1     6
4  2014-02-27     50             Thurs           5     2014-02-27     0     0
5  2014-02-28     50               Fri           6     2014-02-27     1     1
6  2014-03-01     20               Sat           7     2014-02-27     2     2
7  2014-03-02     35               Sun           1     2014-02-27    -4     3
8  2014-03-03     50               Mon           2     2014-02-27    -3     4
9  2014-03-04     35              Tues           3     2014-02-27    -2     5
10 2014-03-05     35               Wed           4     2014-02-27    -1     6
11 2014-03-06     50             Thurs           5     2014-03-06     0     0
12 2014-03-07     35               Fri           6     2014-03-06     1     1
13 2014-03-08     40               Sat           7     2014-03-06     2     2
14 2014-03-09     40               Sun           1     2014-03-06    -4     3
15 2014-03-10     20               Mon           2     2014-03-06    -3     4
16 2014-03-11     50              Tues           3     2014-03-06    -2     5
17 2014-03-12     25               Wed           4     2014-03-06    -1     6
18 2014-03-13     20             Thurs           5     2014-03-13     0     0
19 2014-03-14     30               Fri           6     2014-03-13     1     1
20 2014-03-15     50               Sat           7     2014-03-13     2     2
21 2014-03-16     50               Sun           1     2014-03-13    -4     3
22 2014-03-17     40               Mon           2     2014-03-13    -3     4
23 2014-03-18     40              Tues           3     2014-03-13    -2     5
24 2014-03-19     50               Wed           4     2014-03-13    -1     6
25 2014-03-20     40             Thurs           5     2014-03-20     0     0

来源：https://stackoverflow.com/questions/40554231/dplyr-lubridate-how-to-aggregate-a-dataframe-by-week

标签

dplyr

xts

lubridate