How to Perform Consecutive Counts of Column by Group Conditionally Upon Another Column

问题

I'm trying to get consecutive counts from the Noshow column grouped by the PatientID column. The below code that I am using is very close to the results that I wish to attain. However, using the sum function returns the sum of the whole group. I would like the sum function to only sum the current row and only the rows that have a '1' above it. Basically, I'm trying to count the consecutive amount of times a patient noshows their appointment for each row and then reset to 0 when they do show. It seems like only some tweaks need to be made to my below code. However, I cannot seem to find the answer anywhere on this site.

transform(df, ConsecNoshows = ifelse(Noshow == 0, 0, ave(Noshow, PatientID, FUN = sum)))

The above code produces the below output:

#Source: local data frame [12 x 3]
#Groups: ID [2]
#
#   PatientID Noshow ConsecNoshows
#       <int>  <int>         <int>   
#1          1      0             0
#2          1      1             4
#3          1      0             0
#4          1      1             4
#5          1      1             4
#6          1      1             4
#7          2      0             0
#8          2      0             0
#9          2      1             3
#10         2      1             3
#11         2      0             0
#12         2      1             3

This is what I desire:

#Source: local data frame [12 x 3]
#Groups: ID [2]
#
#   PatientID Noshow ConsecNoshows
#       <int>  <int>         <int>   
#1          1      0             0
#2          1      1             0
#3          1      0             1
#4          1      1             0
#5          1      1             1
#6          1      1             2
#7          2      0             0
#8          2      0             0
#9          2      1             0
#10         2      1             1
#11         2      0             2
#12         2      1             0

[UPDATE] I would like the consecutive count to be offset by one row down.

Thank you for any help you can offer in advance!

回答1:

And here's another (similar) data.table approach

library(data.table)
setDT(df)[, ConsecNoshows := seq(.N) * Noshow, by = .(PatientID, rleid(Noshow))]
df
#     PatientID Noshow ConsecNoshows
#  1:         1      0             0
#  2:         1      1             1
#  3:         1      0             0
#  4:         1      1             1
#  5:         1      1             2
#  6:         1      1             3
#  7:         2      0             0
#  8:         2      0             0
#  9:         2      1             1
# 10:         2      1             2
# 11:         2      0             0
# 12:         2      1             1

This is basically groups by PatientID and "run-length-encoding" of Noshow and creates sequences using the group sizes while multiplying by Noshow in order to keep only the values when Noshow == 1

回答2:

We can use rle from base R (No packages used). Using ave, we group by 'PatientID', get the rle of 'Noshow', multiply the sequence of 'lengths' by the 'values' replicated by 'lengths' to get the expected output.

helperfn <- function(x) with(rle(x), sequence(lengths) * rep(values, lengths))
df$ConsecNoshows <- with(df, ave(Noshow, PatientID, FUN = helperfn))
df$ConsecNoshows 
#[1] 0 1 0 1 2 3 0 0 1 2 0 1

As the OP seems to be using 'tbl_df', a solution in dplyr would be

library(dplyr)
df %>%
   group_by(PatientID) %>%
   mutate(ConsecNoshows = helperfn(Noshow))
#     PatientID Noshow ConsecNoshows
#       <int>  <int>         <int>
#1          1      0             0
#2          1      1             1
#3          1      0             0
#4          1      1             1
#5          1      1             2
#6          1      1             3
#7          2      0             0
#8          2      0             0
#9          2      1             1
#10         2      1             2
#11         2      0             0
#12         2      1             1

回答3:

I would create a helper function to then use whatever implementation you're most comfortable with:

sum0 <- function(x) {x[x == 1]=sequence(with(rle(x), lengths[values == 1]));x}

#base R
transform(df1, Consec = ave(Noshow, PatientID, FUN=sum0))

#dplyr
library(dplyr)
df1 %>% group_by(PatientID) %>% mutate(Consec=sum0(Noshow))

#data.table
library(data.table)
setDT(df1)[, Consec := sum0(Noshow), by = PatientID]
  #    PatientID Noshow Consec
  #        <int>  <int>  <int>
  # 1          1      0      0
  # 2          1      1      1
  # 3          1      0      0
  # 4          1      1      1
  # 5          1      1      2
  # 6          1      1      3
  # 7          2      0      0
  # 8          2      0      0
  # 9          2      1      1
  # 10         2      1      2
  # 11         2      0      0
  # 12         2      1      1

回答4:

The most straight forward way to group consecutive values is to use rleid from data.table, here is an option from data.table package, where you group data by the PatientID as well as rleid of Noshow variable. And also you need the cumsum function to get a cumulative sum of the Noshow variable instead of sum:

library(data.table)
setDT(df)[, ConsecNoshows := ifelse(Noshow == 0, 0, cumsum(Noshow)), .(PatientID, rleid(Noshow))]
df
#    PatientID Noshow ConsecNoshows
# 1:         1      0             0
# 2:         1      1             1
# 3:         1      0             0
# 4:         1      1             1
# 5:         1      1             2
# 6:         1      1             3
# 7:         2      0             0
# 8:         2      0             0
# 9:         2      1             1
#10:         2      1             2
#11:         2      0             0
#12:         2      1             1

来源：https://stackoverflow.com/questions/38704103/how-to-perform-consecutive-counts-of-column-by-group-conditionally-upon-another

标签

sum

aggregate-functions

feature-extraction

feature-selection