Use dplyr to summarize but preserve date of group row

北慕城南 提交于 2020-02-07 01:59:13

问题


I have a data frame like the following:

          Date Flare Painmed_Use
1   2015-12-01     0           0
2   2015-12-02     0           0
3   2015-12-03     0           0
4   2015-12-04     0           0
5   2015-12-05     0           0
6   2015-12-06     0           1
7   2015-12-07     1           4
8   2015-12-08     1           3
9   2015-12-09     1           1
10  2015-12-10     1           0
11  2015-12-11     0           0
12  2015-12-12     0           0
13  2015-12-13     1           2
14  2015-12-14     1           3
15  2015-12-15     1           1
16  2015-12-16     0           0

I'm trying to find the length of each flare as well as the total med use during each flare using dplyr. My current solution (inspired by Use rle to group by runs when using dplyr),

df %>% 
    group_by(yy = {yy = rle(Flare); rep(seq_along(yy$lengths), yy$lengths)}, Flare) %>%
    summarize(Painmed_UseCum = sum(Painmed_Use),FlareLength = n())

gives the following output:

     yy Flare Painmed_UseCum FlareLength
   <int> <int>          <dbl>       <int>
 1     1     0              1           6
 2     2     1              8           4
 3     3     0              0           2
 4     4     1              6           3
 5     5     0              0           1

This is almost exactly what I need. However, I can't figure out how to preserve other columns, the critical one being the date that corresponds to the last row of a particular flare. So, the output I'm seeking is the same as above but with the addition of the Dates, like so:

           Date      yy Flare Painmed_UseCum FlareLength
                  <int> <int>          <dbl>       <int>
 1   2015-12-06       1     0              1           6
 2   2015-12-10       2     1              8           4
 3   2015-12-12       3     0              0           2
 4   2015-12-15       4     1              6           3
 5   2015-12-16       5     0              0           1

Note: In some ways this is a follow up from a previous question of mine (R code to get max count of time series data by group) but my attempt to keep that question simpler, though perhaps useful to others, ended up necessitating this further question.


回答1:


You could either include Date in summarise

library(dplyr)

df %>% 
  group_by(yy = {yy = rle(Flare); rep(seq_along(yy$lengths),yy$lengths)}) %>%
  summarize(Painmed_UseCum = sum(Painmed_Use),FlareLength = n(), Date = max(Date))

# Groups:   yy, Flare [5]
#  Date       Flare Painmed_Use    yy
#  <date>     <int>       <int> <int>
#1 2015-12-06     0           1     1
#2 2015-12-10     1           0     2
#3 2015-12-12     0           0     3
#4 2015-12-15     1           1     4
#5 2015-12-16     0           0     5

Or if there are more columns to preserve better approach is to use mutate and select the last row in each group.

df %>% 
  group_by(yy = {yy = rle(Flare); rep(seq_along(yy$lengths), yy$lengths)}) %>%
  mutate(Painmed_UseCum = sum(Painmed_Use),FlareLength = n()) %>%
  slice(n())

To create groups, we can replace rle with rleid from data.table which would be simpler.

group_by(yy = data.table::rleid(Flare))


来源:https://stackoverflow.com/questions/59978973/use-dplyr-to-summarize-but-preserve-date-of-group-row

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!