Filter rows by a time threshold

问题

I have a dataset organized this way:

ID   Species       DateTime
P1   A             2015-03-16 18:42:00
P2   A             2015-03-16 19:34:00
P3   A             2015-03-16 19:58:00
P4   A             2015-03-16 21:02:00
P5   B             2015-03-16 21:18:00
P6   A             2015-03-16 21:19:00
P7   A             2015-03-16 21:33:00
P8   B             2015-03-16 21:35:00
P9   B             2015-03-16 23:43:00

I want to select independent pictures for each species (that is, pictures separated from each other by 1h), in this dataset with R.

In this example, for species A, I would only want to keep P1, P3 and P4. P2 wouldn't be considered because it falls within the 1h period that started with P1. P3 is considered since its DateTime (19h58) falls after 19h42. And now, the next 1h period would last until 20h58. For species B, only P5 and P9.

Therefore, after this filter, my dataset would look like this:

ID   Species       DateTime
P1   A             2015-03-16 18:42:00
P3   A             2015-03-16 19:58:00
P4   A             2015-03-16 21:02:00
P5   B             2015-03-16 21:18:00
P9   B             2015-03-16 23:43:00

Does someone know how to perform this in R?

回答1:

There may be a more elegant way to do it, but this works:

library(dplyr)

isHourApart <- function(dt) {
    min <- 0
    keeps <- c()
    for (d in dt) {
        if (d >= min + 60 * 60) {
            min <- d
            keeps <- c(keeps, TRUE)
        } else {
            keeps <- c(keeps, FALSE)
        }
    }
    keeps
}


df %>% 
    group_by(Species) %>% 
    filter(isHourApart(DateTime))

> df
# A tibble: 5 x 3
# Groups:   Species [2]
  ID    Species DateTime           
  <chr> <fct>   <dttm>             
1 P1    A       2015-03-16 18:42:00
2 P3    A       2015-03-16 19:58:00
3 P4    A       2015-03-16 21:02:00
4 P5    B       2015-03-16 21:18:00
5 P9    B       2015-03-16 23:43:00

Note that the DateTime column is of class POSIXct.

回答2:

Here is dplyr solution:

require(dplyr);
df %>%
    arrange(Species, DateTime) %>%
    group_by(Species) %>%
    mutate(
        DateTime = as.POSIXct(DateTime),
        diff = abs(lag(DateTime) - DateTime),
        diff = ifelse(is.na(diff), 0, diff),
        cumdiff = cumsum(as.numeric(diff)) %/% 60,
        x = abs(lag(cumdiff) - cumdiff)) %>%
    filter(is.na(x) | x > 0) %>%
    select(ID, Species, DateTime) %>%
    ungroup() %>%
    as.data.frame()
#  ID Species            DateTime
#1 P1       A 2015-03-16 18:42:00
#2 P3       A 2015-03-16 19:58:00
#3 P4       A 2015-03-16 21:02:00
#4 P5       B 2015-03-16 21:18:00
#5 P9       B 2015-03-16 23:43:00

Sample data

df <- read.table(text = "ID   Species       DateTime
P1   A             '2015-03-16 18:42:00'
P2   A             '2015-03-16 19:34:00'
P3   A             '2015-03-16 19:58:00'
P4   A             '2015-03-16 21:02:00'
P5   B             '2015-03-16 21:18:00'
P6   A             '2015-03-16 21:19:00'
P7   A             '2015-03-16 21:33:00'
P8   B             '2015-03-16 21:35:00'
P9   B             '2015-03-16 23:43:00'", header = T);

回答3:

Here's one way of doing it using data.table:

library(data.table)
library(lubridate)

df1 <- read.table(text = "ID   Species       DateTime
P1   A             '2015-03-16 18:42:00'
                 P3   A             '2015-03-16 19:58:00'
                 P4   A             '2015-03-16 21:02:00'
                 P5   B             '2015-03-16 21:18:00'
                 P9   B             '2015-03-16 23:43:00'", 
                 header = TRUE, stringsAsFactors = FALSE)

setDT(df1)
df1[, DateTime := ymd_hms(DateTime)]
df1[, date_range := DateTime + 60 * 60]
df2 <- copy(df1)
df2[, date := DateTime]
df2[, DateTime := NULL]
df <- df2[df1, .(ID, Species, date = x.date, DateTime, date_range), on=.(ID, Species, date >= DateTime, date <= date_range), nomatch = 0L, allow.cartesian = TRUE]
df[, c("date", "date_range") := NULL]

   ID Species            DateTime
1: P1       A 2015-03-16 18:42:00
2: P3       A 2015-03-16 19:58:00
3: P4       A 2015-03-16 21:02:00
4: P5       B 2015-03-16 21:18:00
5: P9       B 2015-03-16 23:43:00

回答4:

We can simply create a new column with 60 minutes intervals and then keep the first ocurrence for each Species.

df %>%
  mutate(by60 = cut(DateTime, "60 min")) %>%
  group_by(Species, by60) %>%
  slice(1)

Output1

# A tibble: 5 x 4
# Groups:   Species, by60 [5]
  ID    Species DateTime            by60               
  <chr> <chr>   <dttm>              <fct>              
1 P1    A       2015-03-16 18:42:00 2015-03-16 18:42:00
2 P3    A       2015-03-16 19:58:00 2015-03-16 19:42:00
3 P4    A       2015-03-16 21:02:00 2015-03-16 20:42:00
4 P5    B       2015-03-16 21:18:00 2015-03-16 20:42:00
5 P9    B       2015-03-16 23:43:00 2015-03-16 23:42:00

If we'd like to drop that dummy column:

df %>%
  mutate(by60 = cut(DateTime, "60 min")) %>%
  group_by(Species, by60) %>%
  slice(1) %>% 
  ungroup() %>% 
  select(-by60)

Output2

# A tibble: 5 x 3
  ID    Species DateTime           
  <chr> <chr>   <dttm>             
1 P1    A       2015-03-16 18:42:00
2 P3    A       2015-03-16 19:58:00
3 P4    A       2015-03-16 21:02:00
4 P5    B       2015-03-16 21:18:00
5 P9    B       2015-03-16 23:43:00

来源：https://stackoverflow.com/questions/49017493/filter-rows-by-a-time-threshold

标签

dataframe

time

filtering