可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have a large data set that has unique IDs for individuals as well as dates, and each individual is capable of multiple encounters.

The below is code and an example of how this data might look:

strDates <- c("09/09/16", "6/7/16", "5/6/16", "2/3/16", "2/1/16", "11/8/16",       "6/8/16", "5/8/16","2/3/16","1/1/16") Date<-as.Date(strDates, "%m/%d/%y") ID <- c("A", "A", "A", "A","A","B","B","B","B","B") Event <- c(1,0,1,0,1,0,1,1,1,0) sample_df <- data.frame(Date,ID,Event)  sample_df           Date ID Event 1  2016-09-09  A     1 2  2016-06-07  A     0 3  2016-05-06  A     1 4  2016-02-03  A     0 5  2016-02-01  A     1 6  2016-11-08  B     0 7  2016-06-08  B     1 8  2016-05-08  B     1 9  2016-02-03  B     1 10 2016-01-01  B     0

I want to keep all attached information per encounter, but then aggregate the following historical information by id

Number of Previous Encounters
Number of Previous Events

As an example, let's look at Row 2.

Row 2 is ID A, so I would reference Rows 3-5 (which occurred prior to Row 2 Encounter). Within this group of rows, we see that Row 3 & 5 both had events.

Number of Previous Encounters for Row 2 = 3

Number of Previous Events for Row 2 = 2

Ideally, I would get the following output:

         Date ID Event PrevEnc PrevEvent 1  2016-09-09  A     1       4         2 2  2016-06-07  A     0       3         2 3  2016-05-06  A     1       2         1 4  2016-02-03  A     0       1         1 5  2016-02-01  A     1       0         0 6  2016-11-08  B     0       4         3 7  2016-06-08  B     1       3         2 8  2016-05-08  B     1       2         1 9  2016-02-03  B     1       1         0 10 2016-01-01  B     0       0         0

So far, I have tried working this problem in dplyr with mutate as well as summarise, both of which have not let me successfully restrict my aggregation to events that occurred previously for a specific ID. I have tried some messy For-loops with If-then statements, but really just wondering if a package or technique exists to simplify this process.

Thank you!

回答1:

The biggest impediment is the current sort order. Here, I stored an original index point, which I later used to re-sort the data (then removed it). Other than that, the basic idea is to count up from 0 for the encounters, and to use cumsum to count the events as they happen. To that end, lag is used to avoid counting the current event.

sample_df %>%   mutate(origIndex = 1:n()) %>%   group_by(ID) %>%   arrange(ID, Date) %>%   mutate(PrevEncounters = 0:(n() -1)          , PrevEvents = cumsum(lag(Event, default = 0))) %>%   arrange(origIndex) %>%   select(-origIndex)

Gives

         Date     ID Event PrevEncounters PrevEvents        <date> <fctr> <dbl>          <int>      <dbl> 1  2016-09-09      A     1              4          2 2  2016-06-07      A     0              3          2 3  2016-05-06      A     1              2          1 4  2016-02-03      A     0              1          1 5  2016-02-01      A     1              0          0 6  2016-11-08      B     0              4          3 7  2016-06-08      B     1              3          2 8  2016-05-08      B     1              2          1 9  2016-02-03      B     1              1          0 10 2016-01-01      B     0              0          0

回答2:

Or, if you want to try with data.table, you can use this:

library(data.table)  # Convert to data.table and sort sample_dt <- as.data.table(sample_df) sample_dt <- sample_dt[order(Date)]  # Count only the previous Events with 1 sample_dt[, prevEvent := ifelse(Event == 1, cumsum(Event) - 1, cumsum(Event)), by = "ID"]  # .I gives the row number, and .SD contains the Subset of the Data for each group sample_dt[, prevEnc := .SD[,.I - 1], by = "ID"]  print(sample_dt)           Date ID Event prevEvent prevEnc  1: 2016-01-01  B     0         0       0  2: 2016-02-01  A     1         0       0  3: 2016-02-03  A     0         1       1  4: 2016-02-03  B     1         0       1  5: 2016-05-06  A     1         1       2  6: 2016-05-08  B     1         1       2  7: 2016-06-07  A     0         2       3  8: 2016-06-08  B     1         2       3  9: 2016-09-09  A     1         2       4 10: 2016-11-08  B     0         3       4

If you didn't know this package, there is an excellent cheat sheet for most of the operations.

回答3:

As @Frank and @MarkPeterson point out, the biggest hurdle here is that the Date column is sorted in descending order. Another approach that does not require resorting the Date column:

library(dplyr) res <- sample_df %>% group_by(ID) %>%                       mutate(PrevEnc=n()-row_number(),                             PrevEvent=rev(cumsum(lag(rev(Event), default=0))))

Here, we use row_number() to determine the row index and n() to determine the number of rows (grouped by ID). Since the Date is sorted in descending order, the number of previous encounter is simply n()-row_number(). To compute the number of previous events, we again exploit the fact that the Date column is sorted in descending order and use rev to reverse the order of the Event column before cumsum the lag of this reversed column. We then use rev again to reverse the result back to the original order.

Using your data:

print(res) ##Source: local data frame [10 x 5] ##Groups: ID [2] ## ##         Date     ID Event PrevEnc PrevEvent ##       <date> <fctr> <dbl>   <int>     <dbl> ##1  2016-09-09      A     1       4         2 ##2  2016-06-07      A     0       3         2 ##3  2016-05-06      A     1       2         1 ##4  2016-02-03      A     0       1         1 ##5  2016-02-01      A     1       0         0 ##6  2016-11-08      B     0       4         3 ##7  2016-06-08      B     1       3         2 ##8  2016-05-08      B     1       2         1 ##9  2016-02-03      B     1       1         0 ##10 2016-01-01      B     0       0         0

文章来源: R: Aggregating History By ID By Date

标签

date

event

history