问题
I'm trying to get consecutive counts from the Noshow column grouped by the PatientID column. The below code that I am using is very close to the results that I wish to attain. However, using the sum function returns the sum of the whole group. I would like the sum function to only sum the current row and only the rows that have a '1' above it. Basically, I'm trying to count the consecutive amount of times a patient noshows their appointment for each row and then reset to 0 when they do show. It seems like only some tweaks need to be made to my below code. However, I cannot seem to find the answer anywhere on this site.
transform(df, ConsecNoshows = ifelse(Noshow == 0, 0, ave(Noshow, PatientID, FUN = sum)))
The above code produces the below output:
#Source: local data frame [12 x 3]
#Groups: ID [2]
#
# PatientID Noshow ConsecNoshows
# <int> <int> <int>
#1 1 0 0
#2 1 1 4
#3 1 0 0
#4 1 1 4
#5 1 1 4
#6 1 1 4
#7 2 0 0
#8 2 0 0
#9 2 1 3
#10 2 1 3
#11 2 0 0
#12 2 1 3
This is what I desire:
#Source: local data frame [12 x 3]
#Groups: ID [2]
#
# PatientID Noshow ConsecNoshows
# <int> <int> <int>
#1 1 0 0
#2 1 1 0
#3 1 0 1
#4 1 1 0
#5 1 1 1
#6 1 1 2
#7 2 0 0
#8 2 0 0
#9 2 1 0
#10 2 1 1
#11 2 0 2
#12 2 1 0
[UPDATE] I would like the consecutive count to be offset by one row down.
Thank you for any help you can offer in advance!
回答1:
And here's another (similar) data.table
approach
library(data.table)
setDT(df)[, ConsecNoshows := seq(.N) * Noshow, by = .(PatientID, rleid(Noshow))]
df
# PatientID Noshow ConsecNoshows
# 1: 1 0 0
# 2: 1 1 1
# 3: 1 0 0
# 4: 1 1 1
# 5: 1 1 2
# 6: 1 1 3
# 7: 2 0 0
# 8: 2 0 0
# 9: 2 1 1
# 10: 2 1 2
# 11: 2 0 0
# 12: 2 1 1
This is basically groups by PatientID
and "run-length-encoding" of Noshow
and creates sequences using the group sizes while multiplying by Noshow
in order to keep only the values when Noshow == 1
回答2:
We can use rle
from base R
(No packages used). Using ave
, we group by 'PatientID', get the rle
of 'Noshow', multiply the sequence
of 'lengths' by the 'values' replicated by 'lengths' to get the expected output.
helperfn <- function(x) with(rle(x), sequence(lengths) * rep(values, lengths))
df$ConsecNoshows <- with(df, ave(Noshow, PatientID, FUN = helperfn))
df$ConsecNoshows
#[1] 0 1 0 1 2 3 0 0 1 2 0 1
As the OP seems to be using 'tbl_df', a solution in dplyr
would be
library(dplyr)
df %>%
group_by(PatientID) %>%
mutate(ConsecNoshows = helperfn(Noshow))
# PatientID Noshow ConsecNoshows
# <int> <int> <int>
#1 1 0 0
#2 1 1 1
#3 1 0 0
#4 1 1 1
#5 1 1 2
#6 1 1 3
#7 2 0 0
#8 2 0 0
#9 2 1 1
#10 2 1 2
#11 2 0 0
#12 2 1 1
回答3:
I would create a helper function to then use whatever implementation you're most comfortable with:
sum0 <- function(x) {x[x == 1]=sequence(with(rle(x), lengths[values == 1]));x}
#base R
transform(df1, Consec = ave(Noshow, PatientID, FUN=sum0))
#dplyr
library(dplyr)
df1 %>% group_by(PatientID) %>% mutate(Consec=sum0(Noshow))
#data.table
library(data.table)
setDT(df1)[, Consec := sum0(Noshow), by = PatientID]
# PatientID Noshow Consec
# <int> <int> <int>
# 1 1 0 0
# 2 1 1 1
# 3 1 0 0
# 4 1 1 1
# 5 1 1 2
# 6 1 1 3
# 7 2 0 0
# 8 2 0 0
# 9 2 1 1
# 10 2 1 2
# 11 2 0 0
# 12 2 1 1
回答4:
The most straight forward way to group consecutive values is to use rleid
from data.table
, here is an option from data.table
package, where you group data by the PatientID
as well as rleid
of Noshow
variable. And also you need the cumsum
function to get a cumulative sum of the Noshow
variable instead of sum
:
library(data.table)
setDT(df)[, ConsecNoshows := ifelse(Noshow == 0, 0, cumsum(Noshow)), .(PatientID, rleid(Noshow))]
df
# PatientID Noshow ConsecNoshows
# 1: 1 0 0
# 2: 1 1 1
# 3: 1 0 0
# 4: 1 1 1
# 5: 1 1 2
# 6: 1 1 3
# 7: 2 0 0
# 8: 2 0 0
# 9: 2 1 1
#10: 2 1 2
#11: 2 0 0
#12: 2 1 1
来源:https://stackoverflow.com/questions/38704103/how-to-perform-consecutive-counts-of-column-by-group-conditionally-upon-another