Filter rows by a time threshold

被刻印的时光 ゝ 提交于 2020-01-13 07:28:28

问题


I have a dataset organized this way:

ID   Species       DateTime
P1   A             2015-03-16 18:42:00
P2   A             2015-03-16 19:34:00
P3   A             2015-03-16 19:58:00
P4   A             2015-03-16 21:02:00
P5   B             2015-03-16 21:18:00
P6   A             2015-03-16 21:19:00
P7   A             2015-03-16 21:33:00
P8   B             2015-03-16 21:35:00
P9   B             2015-03-16 23:43:00

I want to select independent pictures for each species (that is, pictures separated from each other by 1h), in this dataset with R.

In this example, for species A, I would only want to keep P1, P3 and P4. P2 wouldn't be considered because it falls within the 1h period that started with P1. P3 is considered since its DateTime (19h58) falls after 19h42. And now, the next 1h period would last until 20h58. For species B, only P5 and P9.

Therefore, after this filter, my dataset would look like this:

ID   Species       DateTime
P1   A             2015-03-16 18:42:00
P3   A             2015-03-16 19:58:00
P4   A             2015-03-16 21:02:00
P5   B             2015-03-16 21:18:00
P9   B             2015-03-16 23:43:00

Does someone know how to perform this in R?


回答1:


There may be a more elegant way to do it, but this works:

library(dplyr)

isHourApart <- function(dt) {
    min <- 0
    keeps <- c()
    for (d in dt) {
        if (d >= min + 60 * 60) {
            min <- d
            keeps <- c(keeps, TRUE)
        } else {
            keeps <- c(keeps, FALSE)
        }
    }
    keeps
}


df %>% 
    group_by(Species) %>% 
    filter(isHourApart(DateTime))

> df
# A tibble: 5 x 3
# Groups:   Species [2]
  ID    Species DateTime           
  <chr> <fct>   <dttm>             
1 P1    A       2015-03-16 18:42:00
2 P3    A       2015-03-16 19:58:00
3 P4    A       2015-03-16 21:02:00
4 P5    B       2015-03-16 21:18:00
5 P9    B       2015-03-16 23:43:00

Note that the DateTime column is of class POSIXct.




回答2:


Here is dplyr solution:

require(dplyr);
df %>%
    arrange(Species, DateTime) %>%
    group_by(Species) %>%
    mutate(
        DateTime = as.POSIXct(DateTime),
        diff = abs(lag(DateTime) - DateTime),
        diff = ifelse(is.na(diff), 0, diff),
        cumdiff = cumsum(as.numeric(diff)) %/% 60,
        x = abs(lag(cumdiff) - cumdiff)) %>%
    filter(is.na(x) | x > 0) %>%
    select(ID, Species, DateTime) %>%
    ungroup() %>%
    as.data.frame()
#  ID Species            DateTime
#1 P1       A 2015-03-16 18:42:00
#2 P3       A 2015-03-16 19:58:00
#3 P4       A 2015-03-16 21:02:00
#4 P5       B 2015-03-16 21:18:00
#5 P9       B 2015-03-16 23:43:00

Sample data

df <- read.table(text = "ID   Species       DateTime
P1   A             '2015-03-16 18:42:00'
P2   A             '2015-03-16 19:34:00'
P3   A             '2015-03-16 19:58:00'
P4   A             '2015-03-16 21:02:00'
P5   B             '2015-03-16 21:18:00'
P6   A             '2015-03-16 21:19:00'
P7   A             '2015-03-16 21:33:00'
P8   B             '2015-03-16 21:35:00'
P9   B             '2015-03-16 23:43:00'", header = T);



回答3:


Here's one way of doing it using data.table:

library(data.table)
library(lubridate)

df1 <- read.table(text = "ID   Species       DateTime
P1   A             '2015-03-16 18:42:00'
                 P3   A             '2015-03-16 19:58:00'
                 P4   A             '2015-03-16 21:02:00'
                 P5   B             '2015-03-16 21:18:00'
                 P9   B             '2015-03-16 23:43:00'", 
                 header = TRUE, stringsAsFactors = FALSE)

setDT(df1)
df1[, DateTime := ymd_hms(DateTime)]
df1[, date_range := DateTime + 60 * 60]
df2 <- copy(df1)
df2[, date := DateTime]
df2[, DateTime := NULL]
df <- df2[df1, .(ID, Species, date = x.date, DateTime, date_range), on=.(ID, Species, date >= DateTime, date <= date_range), nomatch = 0L, allow.cartesian = TRUE]
df[, c("date", "date_range") := NULL]

   ID Species            DateTime
1: P1       A 2015-03-16 18:42:00
2: P3       A 2015-03-16 19:58:00
3: P4       A 2015-03-16 21:02:00
4: P5       B 2015-03-16 21:18:00
5: P9       B 2015-03-16 23:43:00



回答4:


We can simply create a new column with 60 minutes intervals and then keep the first ocurrence for each Species.

df %>%
  mutate(by60 = cut(DateTime, "60 min")) %>%
  group_by(Species, by60) %>%
  slice(1)

Output1

# A tibble: 5 x 4
# Groups:   Species, by60 [5]
  ID    Species DateTime            by60               
  <chr> <chr>   <dttm>              <fct>              
1 P1    A       2015-03-16 18:42:00 2015-03-16 18:42:00
2 P3    A       2015-03-16 19:58:00 2015-03-16 19:42:00
3 P4    A       2015-03-16 21:02:00 2015-03-16 20:42:00
4 P5    B       2015-03-16 21:18:00 2015-03-16 20:42:00
5 P9    B       2015-03-16 23:43:00 2015-03-16 23:42:00

If we'd like to drop that dummy column:

df %>%
  mutate(by60 = cut(DateTime, "60 min")) %>%
  group_by(Species, by60) %>%
  slice(1) %>% 
  ungroup() %>% 
  select(-by60)

Output2

# A tibble: 5 x 3
  ID    Species DateTime           
  <chr> <chr>   <dttm>             
1 P1    A       2015-03-16 18:42:00
2 P3    A       2015-03-16 19:58:00
3 P4    A       2015-03-16 21:02:00
4 P5    B       2015-03-16 21:18:00
5 P9    B       2015-03-16 23:43:00


来源:https://stackoverflow.com/questions/49017493/filter-rows-by-a-time-threshold

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!