问题
I have a dataset organized this way:
ID Species DateTime
P1 A 2015-03-16 18:42:00
P2 A 2015-03-16 19:34:00
P3 A 2015-03-16 19:58:00
P4 A 2015-03-16 21:02:00
P5 B 2015-03-16 21:18:00
P6 A 2015-03-16 21:19:00
P7 A 2015-03-16 21:33:00
P8 B 2015-03-16 21:35:00
P9 B 2015-03-16 23:43:00
I want to select independent pictures for each species (that is, pictures separated from each other by 1h), in this dataset with R.
In this example, for species A, I would only want to keep P1, P3 and P4. P2 wouldn't be considered because it falls within the 1h period that started with P1. P3 is considered since its DateTime (19h58) falls after 19h42. And now, the next 1h period would last until 20h58. For species B, only P5 and P9.
Therefore, after this filter, my dataset would look like this:
ID Species DateTime
P1 A 2015-03-16 18:42:00
P3 A 2015-03-16 19:58:00
P4 A 2015-03-16 21:02:00
P5 B 2015-03-16 21:18:00
P9 B 2015-03-16 23:43:00
Does someone know how to perform this in R?
回答1:
There may be a more elegant way to do it, but this works:
library(dplyr)
isHourApart <- function(dt) {
min <- 0
keeps <- c()
for (d in dt) {
if (d >= min + 60 * 60) {
min <- d
keeps <- c(keeps, TRUE)
} else {
keeps <- c(keeps, FALSE)
}
}
keeps
}
df %>%
group_by(Species) %>%
filter(isHourApart(DateTime))
> df
# A tibble: 5 x 3
# Groups: Species [2]
ID Species DateTime
<chr> <fct> <dttm>
1 P1 A 2015-03-16 18:42:00
2 P3 A 2015-03-16 19:58:00
3 P4 A 2015-03-16 21:02:00
4 P5 B 2015-03-16 21:18:00
5 P9 B 2015-03-16 23:43:00
Note that the DateTime column is of class POSIXct.
回答2:
Here is dplyr
solution:
require(dplyr);
df %>%
arrange(Species, DateTime) %>%
group_by(Species) %>%
mutate(
DateTime = as.POSIXct(DateTime),
diff = abs(lag(DateTime) - DateTime),
diff = ifelse(is.na(diff), 0, diff),
cumdiff = cumsum(as.numeric(diff)) %/% 60,
x = abs(lag(cumdiff) - cumdiff)) %>%
filter(is.na(x) | x > 0) %>%
select(ID, Species, DateTime) %>%
ungroup() %>%
as.data.frame()
# ID Species DateTime
#1 P1 A 2015-03-16 18:42:00
#2 P3 A 2015-03-16 19:58:00
#3 P4 A 2015-03-16 21:02:00
#4 P5 B 2015-03-16 21:18:00
#5 P9 B 2015-03-16 23:43:00
Sample data
df <- read.table(text = "ID Species DateTime
P1 A '2015-03-16 18:42:00'
P2 A '2015-03-16 19:34:00'
P3 A '2015-03-16 19:58:00'
P4 A '2015-03-16 21:02:00'
P5 B '2015-03-16 21:18:00'
P6 A '2015-03-16 21:19:00'
P7 A '2015-03-16 21:33:00'
P8 B '2015-03-16 21:35:00'
P9 B '2015-03-16 23:43:00'", header = T);
回答3:
Here's one way of doing it using data.table
:
library(data.table)
library(lubridate)
df1 <- read.table(text = "ID Species DateTime
P1 A '2015-03-16 18:42:00'
P3 A '2015-03-16 19:58:00'
P4 A '2015-03-16 21:02:00'
P5 B '2015-03-16 21:18:00'
P9 B '2015-03-16 23:43:00'",
header = TRUE, stringsAsFactors = FALSE)
setDT(df1)
df1[, DateTime := ymd_hms(DateTime)]
df1[, date_range := DateTime + 60 * 60]
df2 <- copy(df1)
df2[, date := DateTime]
df2[, DateTime := NULL]
df <- df2[df1, .(ID, Species, date = x.date, DateTime, date_range), on=.(ID, Species, date >= DateTime, date <= date_range), nomatch = 0L, allow.cartesian = TRUE]
df[, c("date", "date_range") := NULL]
ID Species DateTime
1: P1 A 2015-03-16 18:42:00
2: P3 A 2015-03-16 19:58:00
3: P4 A 2015-03-16 21:02:00
4: P5 B 2015-03-16 21:18:00
5: P9 B 2015-03-16 23:43:00
回答4:
We can simply create a new column with 60 minutes intervals and then keep the first ocurrence for each Species
.
df %>%
mutate(by60 = cut(DateTime, "60 min")) %>%
group_by(Species, by60) %>%
slice(1)
Output1
# A tibble: 5 x 4
# Groups: Species, by60 [5]
ID Species DateTime by60
<chr> <chr> <dttm> <fct>
1 P1 A 2015-03-16 18:42:00 2015-03-16 18:42:00
2 P3 A 2015-03-16 19:58:00 2015-03-16 19:42:00
3 P4 A 2015-03-16 21:02:00 2015-03-16 20:42:00
4 P5 B 2015-03-16 21:18:00 2015-03-16 20:42:00
5 P9 B 2015-03-16 23:43:00 2015-03-16 23:42:00
If we'd like to drop that dummy column:
df %>%
mutate(by60 = cut(DateTime, "60 min")) %>%
group_by(Species, by60) %>%
slice(1) %>%
ungroup() %>%
select(-by60)
Output2
# A tibble: 5 x 3
ID Species DateTime
<chr> <chr> <dttm>
1 P1 A 2015-03-16 18:42:00
2 P3 A 2015-03-16 19:58:00
3 P4 A 2015-03-16 21:02:00
4 P5 B 2015-03-16 21:18:00
5 P9 B 2015-03-16 23:43:00
来源:https://stackoverflow.com/questions/49017493/filter-rows-by-a-time-threshold