问题
I am working on an analysis of animal locations that requires locations for each animal to be 60 minutes or greater apart. Time differences in locations among animals does not matter. The data set has a list of animal IDs and date and time of each location, example below.
For example, for animal 6 below, starting at the 16:19 location, the code would iterate through locations until it finds a location that is 60+ minutes from 16:19. In this case it would be the 17:36 location. Then, the code would start from the 17:36 location to find the next location (18:52) 60+ minutes, and so on. Each of the locations 60+ minutes from each other would then be extracted to a separate dataframe.
I have wrote a loop in R to subset the data, but having issue with the code not accounting for a change in date when calculating if locations are 60 minutes or greater.
I have been exploring the lubridate package, which seems like it may have an easier way to address subsetting my data. However, I have not yet found a solution to subsetting the data to my specifications using this package. Any suggestions for using lubridate or an alternative method would be greatly appreciated.
Thank you in advance for your consideration.
>data(locdata);
>view(locdata);
id date time
6 30-Jun-09 16:19
6 30-Jun-09 16:31
6 30-Jun-09 17:36
6 30-Jun-09 17:45
6 30-Jun-09 18:00
6 30-Jun-09 18:52
6 7-Aug-10 5:30
6 7-Aug-10 5:45
6 7-Aug-10 6:00
6 7-Aug-10 6:45
23 30-Jun-09 17:15
23 30-Jun-09 17:38
23 30-Jun-09 17:56
23 30-Jun-09 20:00
23 30-Jun-09 22:19
23 18-Jul-11 16:22
23 18-Jul-11 17:50
23 18-Jul-11 18:15
The output from the example data above would look like this:
id date time
6 30-Jun-09 16:19
6 30-Jun-09 17:36
6 30-Jun-09 18:52
6 7-Aug-10 5:30
6 7-Aug-10 6:45
23 30-Jun-09 17:15
23 30-Jun-09 20:00
23 30-Jun-09 22:19
23 18-Jul-11 16:22
23 18-Jul-11 17:50
回答1:
If I understood you correctly, I think you're looking for something along these lines:
library(dplyr)
library(lubridate)
locdata %>%
mutate(timestamp = dmy_hm(paste(date, time))) %>%
group_by(id, date) %>%
mutate(delta = timestamp - lag(timestamp))
If you haven't used dplyr
or magrittr
before, the syntax above may be unclear. The %>%
operator passes the results of each computation to the next function, so the above code does the following:
- Parse the date and time into a timestamp that R understands, using
lubridate
- Group the data by
id
and uniquedate
s - Within each group, calculate the duration between observations
If you want to save the output, change the first line to something like results <- locdata %>%
.
Based on your updated question and revised data, I believe this works:
locdata %>%
mutate(timestamp = dmy_hm(paste(date, time))) %>%
group_by(id, date) %>%
mutate(delta = timestamp - first(timestamp),
steps = as.numeric(floor(delta / 3600)),
change = ifelse(is.na(steps - lag(steps)), 1, steps - lag(steps))) %>%
filter(change > 0) %>%
select(id, date, timestamp)
Output:
Source: local data frame [10 x 3]
Groups: id, date
id date timestamp
1 6 30-Jun-09 2009-06-30 16:19:00
2 6 30-Jun-09 2009-06-30 17:36:00
3 6 30-Jun-09 2009-06-30 18:52:00
4 6 7-Aug-10 2010-08-07 05:30:00
5 6 7-Aug-10 2010-08-07 06:45:00
6 23 30-Jun-09 2009-06-30 17:15:00
7 23 30-Jun-09 2009-06-30 20:00:00
8 23 30-Jun-09 2009-06-30 22:19:00
9 23 18-Jul-11 2011-07-18 16:22:00
10 23 18-Jul-11 2011-07-18 17:50:00
How it works:
- Create
timestamp
as before - Group the data by
id
anddate
- Compute the delta in seconds between the first timestamp in each group (i.e. the first observation of one animal in a given day) and each subsequent observation in that group, store that in a new column
delta
- Determine which observations (if any) are more than 3600 seconds from the first one, in increments of 3600 seconds; store that in a new column
steps
- Determine which observations are one or more
step
from the first observation (and keep the first observation as well); store that in a new columnchange
- Keep only observations where
change
is 1 or more -- i.e. where the observation is one or more hours from the previous observation and from the first observation in the group - Keep only the columns of interest
To get comfortable with how it works, drop the filter
and select
from the end and inspect the output:
Source: local data frame [18 x 7]
Groups: id, date
id date time timestamp delta steps change
1 6 30-Jun-09 16:19 2009-06-30 16:19:00 0 secs 0 1
2 6 30-Jun-09 16:31 2009-06-30 16:31:00 720 secs 0 0
3 6 30-Jun-09 17:36 2009-06-30 17:36:00 4620 secs 1 1
4 6 30-Jun-09 17:45 2009-06-30 17:45:00 5160 secs 1 0
5 6 30-Jun-09 18:00 2009-06-30 18:00:00 6060 secs 1 0
6 6 30-Jun-09 18:52 2009-06-30 18:52:00 9180 secs 2 1
7 6 7-Aug-10 5:30 2010-08-07 05:30:00 0 secs 0 1
8 6 7-Aug-10 5:45 2010-08-07 05:45:00 900 secs 0 0
9 6 7-Aug-10 6:00 2010-08-07 06:00:00 1800 secs 0 0
10 6 7-Aug-10 6:45 2010-08-07 06:45:00 4500 secs 1 1
11 23 30-Jun-09 17:15 2009-06-30 17:15:00 0 secs 0 1
12 23 30-Jun-09 17:38 2009-06-30 17:38:00 1380 secs 0 0
13 23 30-Jun-09 17:56 2009-06-30 17:56:00 2460 secs 0 0
14 23 30-Jun-09 20:00 2009-06-30 20:00:00 9900 secs 2 2
15 23 30-Jun-09 22:19 2009-06-30 22:19:00 18240 secs 5 3
16 23 18-Jul-11 16:22 2011-07-18 16:22:00 0 secs 0 1
17 23 18-Jul-11 17:50 2011-07-18 17:50:00 5280 secs 1 1
18 23 18-Jul-11 18:15 2011-07-18 18:15:00 6780 secs 1 0
回答2:
I managed to build a function using tapply
that selects the proper times and have unpacked it in a couple of different versions, although I've not yet assembled it in a form that matches your suggested output. Thinking about it I'm wondering if it may be easier get the proper form using lapply-split
:
tapply(dat$d_time, list(dat$id, dat$date),
function(dt) {
Reduce( function(x,y) {
if( as.numeric(y)-as.numeric(tail(x,1)) < 60*60){
x } else {
(x,y)} } ,
dt,
init=dt[1]))
#------------
18-Jul-11 30-Jun-09 7-Aug-10
6 NULL Numeric,3 Numeric,2
23 Numeric,2 Numeric,3 NULL
# c( ) removes the dimensions and unfortunately the INDEX items
c(tapply(dat$d_time, list(dat$id, dat$date), function(dt) Reduce( function(x,y) if(as.numeric(y)-as.numeric(tail(x,1)) < 60*60){ x } else {c(x,y)} , dt, init=dt[1])))
[[1]]
NULL
[[2]]
[1] "2011-07-18 16:22:00 PDT" "2011-07-18 17:50:00 PDT"
[[3]]
[1] "2009-06-30 16:19:00 PDT" "2009-06-30 17:36:00 PDT"
[3] "2009-06-30 18:52:00 PDT"
[[4]]
[1] "2009-06-30 17:15:00 PDT" "2009-06-30 20:00:00 PDT"
[3] "2009-06-30 22:19:00 PDT"
[[5]]
[1] "2010-08-07 05:30:00 PDT" "2010-08-07 06:45:00 PDT"
[[6]]
NULL
# unlist does something similar
unlist(tapply(dat$d_time, list(dat$id, dat$date), function(dt) Reduce( function(x,y) if(as.numeric(y)-as.numeric(tail(x,1)) < 60*60){ x } else {c(x,y)} , dt, init=dt[1])))
[1] 1311031320 1311036600 1246403940 1246408560 1246413120 1246407300
[7] 1246417200 1246425540 1281184200 1281188700
# It's possible to restore the date-time class.
> as.POSIXct(unlist(tapply(dat$d_time,
list(dat$id, dat$date),
function(dt) Reduce( function(x,y) if(as.numeric(y)-as.numeric(tail(x,1)) < 60*60){ x } else {c(x,y)} , dt, init=dt[1]))) , origin="1970-01-01")
[1] "2011-07-18 16:22:00 PDT" "2011-07-18 17:50:00 PDT"
[3] "2009-06-30 16:19:00 PDT" "2009-06-30 17:36:00 PDT"
[5] "2009-06-30 18:52:00 PDT" "2009-06-30 17:15:00 PDT"
[7] "2009-06-30 20:00:00 PDT" "2009-06-30 22:19:00 PDT"
[9] "2010-08-07 05:30:00 PDT" "2010-08-07 06:45:00 PDT"
# This keeps the INDEX values as row and column names
as.data.frame( tapply(dat$d_time, list(dat$id, dat$date), function(dt) Reduce( function(x,y) if(as.numeric(y)-as.numeric(tail(x,1)) < 60*60){ x } else {c(x,y)} , dt, init=dt[1])) )
18-Jul-11 30-Jun-09
6 NULL 1246403940, 1246408560, 1246413120
23 1311031320, 1311036600 1246407300, 1246417200, 1246425540
7-Aug-10
6 1281184200, 1281188700
23 NULL
来源:https://stackoverflow.com/questions/29903994/how-to-subset-and-extract-time-series-by-time-interval-in-row