How to subset and extract time series by time interval in row

问题

I am working on an analysis of animal locations that requires locations for each animal to be 60 minutes or greater apart. Time differences in locations among animals does not matter. The data set has a list of animal IDs and date and time of each location, example below.

For example, for animal 6 below, starting at the 16:19 location, the code would iterate through locations until it finds a location that is 60+ minutes from 16:19. In this case it would be the 17:36 location. Then, the code would start from the 17:36 location to find the next location (18:52) 60+ minutes, and so on. Each of the locations 60+ minutes from each other would then be extracted to a separate dataframe.

I have wrote a loop in R to subset the data, but having issue with the code not accounting for a change in date when calculating if locations are 60 minutes or greater.

I have been exploring the lubridate package, which seems like it may have an easier way to address subsetting my data. However, I have not yet found a solution to subsetting the data to my specifications using this package. Any suggestions for using lubridate or an alternative method would be greatly appreciated.

Thank you in advance for your consideration.

>data(locdata);
>view(locdata);
id  date    time
6   30-Jun-09   16:19
6   30-Jun-09   16:31
6   30-Jun-09   17:36
6   30-Jun-09   17:45
6   30-Jun-09   18:00
6   30-Jun-09   18:52
6   7-Aug-10    5:30
6   7-Aug-10    5:45
6   7-Aug-10    6:00
6   7-Aug-10    6:45
23  30-Jun-09   17:15
23  30-Jun-09   17:38
23  30-Jun-09   17:56
23  30-Jun-09   20:00
23  30-Jun-09   22:19
23  18-Jul-11   16:22
23  18-Jul-11   17:50
23  18-Jul-11   18:15

The output from the example data above would look like this:

id  date    time
6   30-Jun-09   16:19
6   30-Jun-09   17:36
6   30-Jun-09   18:52
6   7-Aug-10    5:30
6   7-Aug-10    6:45
23  30-Jun-09   17:15
23  30-Jun-09   20:00
23  30-Jun-09   22:19
23  18-Jul-11   16:22
23  18-Jul-11   17:50

回答1:

If I understood you correctly, I think you're looking for something along these lines:

library(dplyr)
library(lubridate)

locdata %>% 
    mutate(timestamp = dmy_hm(paste(date, time))) %>%
    group_by(id, date) %>%
    mutate(delta = timestamp - lag(timestamp))

If you haven't used dplyr or magrittr before, the syntax above may be unclear. The %>% operator passes the results of each computation to the next function, so the above code does the following:

Parse the date and time into a timestamp that R understands, using lubridate
Group the data by id and unique dates
Within each group, calculate the duration between observations

If you want to save the output, change the first line to something like results <- locdata %>%.

Based on your updated question and revised data, I believe this works:

locdata %>% 
    mutate(timestamp = dmy_hm(paste(date, time))) %>%
    group_by(id, date) %>%
    mutate(delta = timestamp - first(timestamp),
           steps = as.numeric(floor(delta / 3600)), 
           change = ifelse(is.na(steps - lag(steps)), 1, steps - lag(steps))) %>%
    filter(change > 0) %>%
    select(id, date, timestamp)

Output:

Source: local data frame [10 x 3]
Groups: id, date

   id      date           timestamp
1   6 30-Jun-09 2009-06-30 16:19:00
2   6 30-Jun-09 2009-06-30 17:36:00
3   6 30-Jun-09 2009-06-30 18:52:00
4   6  7-Aug-10 2010-08-07 05:30:00
5   6  7-Aug-10 2010-08-07 06:45:00
6  23 30-Jun-09 2009-06-30 17:15:00
7  23 30-Jun-09 2009-06-30 20:00:00
8  23 30-Jun-09 2009-06-30 22:19:00
9  23 18-Jul-11 2011-07-18 16:22:00
10 23 18-Jul-11 2011-07-18 17:50:00

How it works:

Create timestamp as before
Group the data by id and date
Compute the delta in seconds between the first timestamp in each group (i.e. the first observation of one animal in a given day) and each subsequent observation in that group, store that in a new column delta
Determine which observations (if any) are more than 3600 seconds from the first one, in increments of 3600 seconds; store that in a new column steps
Determine which observations are one or more step from the first observation (and keep the first observation as well); store that in a new column change
Keep only observations where change is 1 or more -- i.e. where the observation is one or more hours from the previous observation and from the first observation in the group
Keep only the columns of interest

To get comfortable with how it works, drop the filter and select from the end and inspect the output:

Source: local data frame [18 x 7]
Groups: id, date

   id      date  time           timestamp      delta steps change
1   6 30-Jun-09 16:19 2009-06-30 16:19:00     0 secs     0      1
2   6 30-Jun-09 16:31 2009-06-30 16:31:00   720 secs     0      0
3   6 30-Jun-09 17:36 2009-06-30 17:36:00  4620 secs     1      1
4   6 30-Jun-09 17:45 2009-06-30 17:45:00  5160 secs     1      0
5   6 30-Jun-09 18:00 2009-06-30 18:00:00  6060 secs     1      0
6   6 30-Jun-09 18:52 2009-06-30 18:52:00  9180 secs     2      1
7   6  7-Aug-10  5:30 2010-08-07 05:30:00     0 secs     0      1
8   6  7-Aug-10  5:45 2010-08-07 05:45:00   900 secs     0      0
9   6  7-Aug-10  6:00 2010-08-07 06:00:00  1800 secs     0      0
10  6  7-Aug-10  6:45 2010-08-07 06:45:00  4500 secs     1      1
11 23 30-Jun-09 17:15 2009-06-30 17:15:00     0 secs     0      1
12 23 30-Jun-09 17:38 2009-06-30 17:38:00  1380 secs     0      0
13 23 30-Jun-09 17:56 2009-06-30 17:56:00  2460 secs     0      0
14 23 30-Jun-09 20:00 2009-06-30 20:00:00  9900 secs     2      2
15 23 30-Jun-09 22:19 2009-06-30 22:19:00 18240 secs     5      3
16 23 18-Jul-11 16:22 2011-07-18 16:22:00     0 secs     0      1
17 23 18-Jul-11 17:50 2011-07-18 17:50:00  5280 secs     1      1
18 23 18-Jul-11 18:15 2011-07-18 18:15:00  6780 secs     1      0

回答2:

I managed to build a function using tapply that selects the proper times and have unpacked it in a couple of different versions, although I've not yet assembled it in a form that matches your suggested output. Thinking about it I'm wondering if it may be easier get the proper form using lapply-split:

 tapply(dat$d_time, list(dat$id, dat$date), 
                    function(dt) {
        Reduce( function(x,y) {
                   if( as.numeric(y)-as.numeric(tail(x,1)) < 60*60){
                      x } else {
                     (x,y)} } , 
              dt, 
              init=dt[1]))
 #------------
   18-Jul-11 30-Jun-09 7-Aug-10 
6  NULL      Numeric,3 Numeric,2
23 Numeric,2 Numeric,3 NULL    

# c( ) removes the dimensions and unfortunately the INDEX items
c(tapply(dat$d_time, list(dat$id, dat$date), function(dt) Reduce( function(x,y) if(as.numeric(y)-as.numeric(tail(x,1)) < 60*60){ x } else {c(x,y)} , dt, init=dt[1])))
[[1]]
NULL

[[2]]
[1] "2011-07-18 16:22:00 PDT" "2011-07-18 17:50:00 PDT"

[[3]]
[1] "2009-06-30 16:19:00 PDT" "2009-06-30 17:36:00 PDT"
[3] "2009-06-30 18:52:00 PDT"

[[4]]
[1] "2009-06-30 17:15:00 PDT" "2009-06-30 20:00:00 PDT"
[3] "2009-06-30 22:19:00 PDT"

[[5]]
[1] "2010-08-07 05:30:00 PDT" "2010-08-07 06:45:00 PDT"

[[6]]
NULL

# unlist does something similar
unlist(tapply(dat$d_time, list(dat$id, dat$date), function(dt) Reduce( function(x,y) if(as.numeric(y)-as.numeric(tail(x,1)) < 60*60){ x } else {c(x,y)} , dt, init=dt[1])))
 [1] 1311031320 1311036600 1246403940 1246408560 1246413120 1246407300
 [7] 1246417200 1246425540 1281184200 1281188700

# It's possible to restore the date-time class.
 > as.POSIXct(unlist(tapply(dat$d_time, 
                            list(dat$id, dat$date), 
                            function(dt) Reduce( function(x,y) if(as.numeric(y)-as.numeric(tail(x,1)) < 60*60){ x } else {c(x,y)} , dt, init=dt[1]))) , origin="1970-01-01")

 [1] "2011-07-18 16:22:00 PDT" "2011-07-18 17:50:00 PDT"
 [3] "2009-06-30 16:19:00 PDT" "2009-06-30 17:36:00 PDT"
 [5] "2009-06-30 18:52:00 PDT" "2009-06-30 17:15:00 PDT"
 [7] "2009-06-30 20:00:00 PDT" "2009-06-30 22:19:00 PDT"
 [9] "2010-08-07 05:30:00 PDT" "2010-08-07 06:45:00 PDT"

 # This keeps the INDEX values as row and column names
 as.data.frame( tapply(dat$d_time, list(dat$id, dat$date), function(dt) Reduce( function(x,y) if(as.numeric(y)-as.numeric(tail(x,1)) < 60*60){ x } else {c(x,y)} , dt, init=dt[1])) )
                18-Jul-11                          30-Jun-09
6                    NULL 1246403940, 1246408560, 1246413120
23 1311031320, 1311036600 1246407300, 1246417200, 1246425540
                 7-Aug-10
6  1281184200, 1281188700
23                   NULL

来源：https://stackoverflow.com/questions/29903994/how-to-subset-and-extract-time-series-by-time-interval-in-row

标签

time-series

lubridate