问题
This is a follow up to a question asked previously (Filter for events that occur within a time range of event "A" in r). Since the original post was answered correctly I decided to start a new question. If this is improper let me know.
Quick recap. I have event data with a second value. I wanted to filter all B events that came 5 seconds prior to all A events.
The issue I've run into is that the data is split into periods and the seconds restart. I didn't think this would be an issue as the data was sorted, so didn't include a periods column in my original question, but there as been some unexpected results.
Here is a sample of data with the addition of a period column.
set.seed(123)
event_df <- tibble(time_sec = c(1:120)) %>%
sample_n(100) %>%
mutate(period = sample(c(1,2,3),
size = 100,
replace = TRUE),
event = sample(c("A","B"),
size = 100,
replace = TRUE,
prob = c(0.1,0.9))) %>%
select(period, time_sec, event) %>%
arrange(period, time_sec)
When using the solution that originally worked...
event_df %>%
group_by(grp = lag(cumsum(event == 'A'), default = 0)) %>%
filter((last(time_sec) - time_sec) <=5)
... you'll notice that it works correctly except for the first A event of each period grabs all the B events in the prior period regardless of the time. For example, grp 4 looks like this:
~period, ~time_sec, ~event, ~grp
1 111, "B" 4
1 114, "B" 4
1 120, "B" 4
2 79, "B" 4
2 83, "A" 4
Expected output for grp 4 would be:
~period, ~time_sec, ~event, ~grp
2 79, "B" 4
2 83, "A" 4
I tried grouping by period thinking this would solve the issue, and while it filtered out most of the events, it still took the last event from the previous period.
event_df %>%
group_by(period,
grp = lag(cumsum(event == 'A'), default = 0)) %>%
filter((last(time_sec) - time_sec) <=5)
Results in:
~period, ~time_sec, ~event, ~grp
1 120, "B" 4
2 79, "B" 4
2 83, "A" 4
Closer, but still grabbing the last event from the previous period.
Update: Realized that the numbers were included because they time diff was a negative number. This solves it except there is a final grouping with no A event.
event_df %>%
group_by(grp = lag(cumsum(event == 'A'), default = 0)) %>%
filter((last(time_sec) - time_sec) <=5 & (last(time_sec) - time_sec) >= 0 )
回答1:
Since you added period
to group_by()
your grp
values cross period
values. So if the period doesn't end in an event "A" it uses an event "B" value for last(time_sec)-time_sec
. So it always returns the final value in the period and any other "B" events within 5 seconds of it. A simple solution (works for the toy data, not sure on the real data) is modify the filter()
command to make sure we're getting the true last
value (which is the only event "A" in the grp
):
event_df %>%
group_by(grp = lag(cumsum(event == 'A'), default = 0), period) %>%
filter((last(time_sec[event=='A']) - time_sec) <=5)
This works because the value of last(time_sec[event=='A'])
is NA
if there are no event "A" observations in the period, grp
pair.
回答2:
This works too. A cheating version where I chop off anything after the last "A" event.
event_df %>%
slice(1:max(which(event=="A"))) %>%
group_by(grp = lag(cumsum(event == 'A'), default = 0)) %>%
filter((last(time_sec) - time_sec) <=5 & (last(time_sec) - time_sec) >= 0)
来源:https://stackoverflow.com/questions/65222140/filter-for-events-that-occur-within-a-time-range-of-event-a-in-r-part-2