Filter for events that occur within a time range of event “A” in r part 2

孤人 提交于 2020-12-16 02:21:19

问题


This is a follow up to a question asked previously (Filter for events that occur within a time range of event "A" in r). Since the original post was answered correctly I decided to start a new question. If this is improper let me know.

Quick recap. I have event data with a second value. I wanted to filter all B events that came 5 seconds prior to all A events.

The issue I've run into is that the data is split into periods and the seconds restart. I didn't think this would be an issue as the data was sorted, so didn't include a periods column in my original question, but there as been some unexpected results.

Here is a sample of data with the addition of a period column.

set.seed(123)
event_df <- tibble(time_sec = c(1:120)) %>% 
  sample_n(100) %>%
  mutate(period = sample(c(1,2,3),
                       size = 100,
                       replace = TRUE),
         event = sample(c("A","B"), 
                        size = 100, 
                        replace = TRUE, 
                        prob = c(0.1,0.9))) %>% 
  select(period, time_sec, event) %>% 
  arrange(period, time_sec)

When using the solution that originally worked...

event_df %>%
  group_by(grp =  lag(cumsum(event == 'A'), default = 0)) %>% 
  filter((last(time_sec) - time_sec) <=5)

... you'll notice that it works correctly except for the first A event of each period grabs all the B events in the prior period regardless of the time. For example, grp 4 looks like this:

~period, ~time_sec, ~event, ~grp
1        111,       "B"    4
1        114,       "B"    4
1        120,       "B"    4
2        79,        "B"    4
2        83,        "A"    4

Expected output for grp 4 would be:

~period, ~time_sec, ~event, ~grp
2        79,        "B"    4
2        83,        "A"    4

I tried grouping by period thinking this would solve the issue, and while it filtered out most of the events, it still took the last event from the previous period.

event_df %>%
  group_by(period,
           grp =  lag(cumsum(event == 'A'), default = 0)) %>% 
  filter((last(time_sec) - time_sec) <=5)

Results in:

~period, ~time_sec, ~event, ~grp
1        120,       "B"    4
2        79,        "B"    4
2        83,        "A"    4

Closer, but still grabbing the last event from the previous period.

Update: Realized that the numbers were included because they time diff was a negative number. This solves it except there is a final grouping with no A event.

event_df %>%
  group_by(grp =  lag(cumsum(event == 'A'), default = 0)) %>% 
  filter((last(time_sec) - time_sec) <=5 & (last(time_sec) - time_sec) >= 0 )

回答1:


Since you added period to group_by() your grp values cross period values. So if the period doesn't end in an event "A" it uses an event "B" value for last(time_sec)-time_sec. So it always returns the final value in the period and any other "B" events within 5 seconds of it. A simple solution (works for the toy data, not sure on the real data) is modify the filter() command to make sure we're getting the true last value (which is the only event "A" in the grp):

event_df %>%
  group_by(grp =  lag(cumsum(event == 'A'), default = 0), period) %>% 
  filter((last(time_sec[event=='A']) - time_sec) <=5)

This works because the value of last(time_sec[event=='A']) is NA if there are no event "A" observations in the period, grp pair.




回答2:


This works too. A cheating version where I chop off anything after the last "A" event.

event_df %>% 
  slice(1:max(which(event=="A"))) %>% 
  group_by(grp = lag(cumsum(event == 'A'), default = 0)) %>% 
  filter((last(time_sec) - time_sec) <=5 & (last(time_sec) - time_sec) >= 0)


来源:https://stackoverflow.com/questions/65222140/filter-for-events-that-occur-within-a-time-range-of-event-a-in-r-part-2

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!