Removing duplicate dates based on another column in R

北城以北 提交于 2020-01-14 10:20:13

问题


I have a timeseries with multiple entries for some hours.

                 date  wd  ws temp sol octa pg  mh daterep
1 2007-01-01 00:00:00 100 1.5  9.0   0    8  D 100   FALSE
2 2007-01-01 01:00:00  90 2.6  9.0   0    7  E  50    TRUE
3 2007-01-01 01:00:00  90 2.6  9.0   0    8  D 100    TRUE
4 2007-01-01 02:00:00  40 1.0  8.8   0    7  F  50   FALSE
5 2007-01-01 03:00:00  20 2.1  8.0   0    8  D 100   FALSE
6 2007-01-01 04:00:00  30 1.0  8.0   0    8  D 100   FALSE

I need to get to a time series with one entry per hour, taking the entry with the minimum mh value where there are multiple entries. (So in the data above my second entry should be row 2 and row 3 should be removed.) I've been working on both approaches: picking out what I want into a new dataframe, and removing what I don't want in the existing, but not getting anywhere. Thanks for your help.


回答1:


You could sort your data by date and mh using plyr::arrange, then remove duplicates:

df <- read.table(textConnection("

               date    wd  ws temp sol octa pg  mh daterep
'2007-01-01 00:00:00' 100 1.5  9.0   0    8  D 100   FALSE
'2007-01-01 01:00:00'  90 2.6  9.0   0    7  E  50    TRUE
'2007-01-01 01:00:00'  90 2.6  9.0   0    8  D 100    TRUE
'2007-01-01 02:00:00'  40 1.0  8.8   0    7  F  50   FALSE
'2007-01-01 03:00:00'  20 2.1  8.0   0    8  D 100   FALSE
'2007-01-01 04:00:00'  30 1.0  8.0   0    8  D 100   FALSE

"), header = TRUE)

library(plyr)
df <- arrange(df, date, mh)
df <- df[!duplicated(df$date), ]
df
#                  date  wd  ws temp sol octa pg  mh daterep
# 1 2007-01-01 00:00:00 100 1.5  9.0   0    8  D 100   FALSE
# 2 2007-01-01 01:00:00  90 2.6  9.0   0    7  E  50    TRUE
# 4 2007-01-01 02:00:00  40 1.0  8.8   0    7  F  50   FALSE
# 5 2007-01-01 03:00:00  20 2.1  8.0   0    8  D 100   FALSE
# 6 2007-01-01 04:00:00  30 1.0  8.0   0    8  D 100   FALSE



回答2:


Similar to flodel, but using base R and ensuring that date is a real DateTimeClass:

d <- read.table(text = "
               date    wd  ws temp sol octa pg  mh daterep
'2007-01-01 00:00:00' 100 1.5  9.0   0    8  D 100   FALSE
'2007-01-01 01:00:00'  90 2.6  9.0   0    7  E  50    TRUE
'2007-01-01 01:00:00'  90 2.6  9.0   0    8  D 100    TRUE
'2007-01-01 02:00:00'  40 1.0  8.8   0    7  F  50   FALSE
'2007-01-01 03:00:00'  20 2.1  8.0   0    8  D 100   FALSE
'2007-01-01 04:00:00'  30 1.0  8.0   0    8  D 100   FALSE
", header = TRUE)

d$date <- as.POSIXct(d$date)

d <- d[order(d$date, d$mh), ]
d[!duplicated(d$date), ]

                 date  wd  ws temp sol octa pg  mh daterep
1 2007-01-01 00:00:00 100 1.5  9.0   0    8  D 100   FALSE
2 2007-01-01 01:00:00  90 2.6  9.0   0    7  E  50    TRUE
4 2007-01-01 02:00:00  40 1.0  8.8   0    7  F  50   FALSE
5 2007-01-01 03:00:00  20 2.1  8.0   0    8  D 100   FALSE
6 2007-01-01 04:00:00  30 1.0  8.0   0    8  D 100   FALSE


来源:https://stackoverflow.com/questions/10544128/removing-duplicate-dates-based-on-another-column-in-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!