问题
I have a timeseries with multiple entries for some hours.
date wd ws temp sol octa pg mh daterep
1 2007-01-01 00:00:00 100 1.5 9.0 0 8 D 100 FALSE
2 2007-01-01 01:00:00 90 2.6 9.0 0 7 E 50 TRUE
3 2007-01-01 01:00:00 90 2.6 9.0 0 8 D 100 TRUE
4 2007-01-01 02:00:00 40 1.0 8.8 0 7 F 50 FALSE
5 2007-01-01 03:00:00 20 2.1 8.0 0 8 D 100 FALSE
6 2007-01-01 04:00:00 30 1.0 8.0 0 8 D 100 FALSE
I need to get to a time series with one entry per hour, taking the entry with the minimum mh value where there are multiple entries. (So in the data above my second entry should be row 2 and row 3 should be removed.) I've been working on both approaches: picking out what I want into a new dataframe, and removing what I don't want in the existing, but not getting anywhere. Thanks for your help.
回答1:
You could sort your data by date
and mh
using plyr::arrange
, then remove duplicates:
df <- read.table(textConnection("
date wd ws temp sol octa pg mh daterep
'2007-01-01 00:00:00' 100 1.5 9.0 0 8 D 100 FALSE
'2007-01-01 01:00:00' 90 2.6 9.0 0 7 E 50 TRUE
'2007-01-01 01:00:00' 90 2.6 9.0 0 8 D 100 TRUE
'2007-01-01 02:00:00' 40 1.0 8.8 0 7 F 50 FALSE
'2007-01-01 03:00:00' 20 2.1 8.0 0 8 D 100 FALSE
'2007-01-01 04:00:00' 30 1.0 8.0 0 8 D 100 FALSE
"), header = TRUE)
library(plyr)
df <- arrange(df, date, mh)
df <- df[!duplicated(df$date), ]
df
# date wd ws temp sol octa pg mh daterep
# 1 2007-01-01 00:00:00 100 1.5 9.0 0 8 D 100 FALSE
# 2 2007-01-01 01:00:00 90 2.6 9.0 0 7 E 50 TRUE
# 4 2007-01-01 02:00:00 40 1.0 8.8 0 7 F 50 FALSE
# 5 2007-01-01 03:00:00 20 2.1 8.0 0 8 D 100 FALSE
# 6 2007-01-01 04:00:00 30 1.0 8.0 0 8 D 100 FALSE
回答2:
Similar to flodel, but using base R and ensuring that date
is a real DateTimeClass:
d <- read.table(text = "
date wd ws temp sol octa pg mh daterep
'2007-01-01 00:00:00' 100 1.5 9.0 0 8 D 100 FALSE
'2007-01-01 01:00:00' 90 2.6 9.0 0 7 E 50 TRUE
'2007-01-01 01:00:00' 90 2.6 9.0 0 8 D 100 TRUE
'2007-01-01 02:00:00' 40 1.0 8.8 0 7 F 50 FALSE
'2007-01-01 03:00:00' 20 2.1 8.0 0 8 D 100 FALSE
'2007-01-01 04:00:00' 30 1.0 8.0 0 8 D 100 FALSE
", header = TRUE)
d$date <- as.POSIXct(d$date)
d <- d[order(d$date, d$mh), ]
d[!duplicated(d$date), ]
date wd ws temp sol octa pg mh daterep
1 2007-01-01 00:00:00 100 1.5 9.0 0 8 D 100 FALSE
2 2007-01-01 01:00:00 90 2.6 9.0 0 7 E 50 TRUE
4 2007-01-01 02:00:00 40 1.0 8.8 0 7 F 50 FALSE
5 2007-01-01 03:00:00 20 2.1 8.0 0 8 D 100 FALSE
6 2007-01-01 04:00:00 30 1.0 8.0 0 8 D 100 FALSE
来源:https://stackoverflow.com/questions/10544128/removing-duplicate-dates-based-on-another-column-in-r