Calculate the mean between several columns of df2 that can vary according to the variable `var1` of df1 and add the value to a new variable in df1

匆匆过客 提交于 2019-12-06 14:53:11

This question comprises interesting challenges:

  1. The OP is asking for a "partial anti-join", i.e., the OP wants to aggregate current data in df2 where Datetime and Site are matching but the depth layer does not.
  2. The current data df2 are given in a look-up table where each value is associated with a depth range (depth layer) and a time range of 3 hours. So, the measured Depth and Datetime in df1 need to be mapped onto the respective ranges.

I have tried different approaches but I ended up with the one below which does not make assumptions about the aggregate function. So, mean() can be called directly.

library(data.table)
library(magrittr)

# reshape df2 from wide to long format
currents <- melt(setDT(df2), id.vars = c("Datetime", "Site"),
                 variable.name = "layer", value.name = "current")

# create columns to join on
labels <- names(df2) %>% stringr::str_subset("^m")
breaks <- c(seq(0, 32, 8), Inf)
setDT(df1)[, layer := cut(Depth, breaks = breaks, labels = labels)]
df1[, current.dt := df2[df1, on = .(Site, Datetime), 
                      roll = "nearest", x.Datetime]]

# "partial anti-join" to compute mean of other layers
currents_other_layers <- 
  currents[df1, on = .(Site, Datetime = current.dt)][
    layer != i.layer, mean(current), by = .(i.Datetime, Site)]

# append result column
df1[currents_other_layers, on = .(Site, Datetime = i.Datetime), current.mean := i.V1]
df1
               Datetime Site Ind Depth  layer          current.dt current.mean
1: 2016-08-01 15:34:07   BD  16   5.3   m0-7 2016-08-01 15:00:00       3.7500
2: 2016-08-01 16:25:16   HG  17  24.0 m16-23                <NA>           NA
3: 2016-08-01 17:29:16   BD  19  36.4 m32-39 2016-08-01 18:00:00       5.9375
4: 2016-08-01 18:33:16   BD  16  42.0 m32-39 2016-08-01 18:00:00       5.9375
5: 2016-08-01 20:54:16   BD  17    NA   <NA> 2016-08-01 21:00:00           NA
6: 2016-08-01 22:48:16   BD  16  22.1 m16-23 2016-08-02 00:00:00       2.8750

This reproduces OP's expected result.

Explanation

df2 is reshaped from wide to long format. This allows for joining / anti-joining on the layer column.

currents
               Datetime Site  layer current
 1: 2016-08-01 12:00:00   BD   m0-7    2.75
 2: 2016-08-01 15:00:00   BD   m0-7    4.00
 3: 2016-08-01 18:00:00   BD   m0-7    6.75
 4: 2016-08-01 21:00:00   BD   m0-7    2.25
 5: 2016-08-02 00:00:00   BD   m0-7    4.30
 6: 2016-08-01 12:00:00   BD  m8-15    3.00
 7: 2016-08-01 15:00:00   BD  m8-15    4.00
 8: 2016-08-01 18:00:00   BD  m8-15    4.75
 9: 2016-08-01 21:00:00   BD  m8-15    3.00
10: 2016-08-02 00:00:00   BD  m8-15    2.10
11: 2016-08-01 12:00:00   BD m16-23    2.75
12: 2016-08-01 15:00:00   BD m16-23    4.00
13: 2016-08-01 18:00:00   BD m16-23    5.75
14: 2016-08-01 21:00:00   BD m16-23    2.25
15: 2016-08-02 00:00:00   BD m16-23    1.40
16: 2016-08-01 12:00:00   BD m24-31    3.25
17: 2016-08-01 15:00:00   BD m24-31    3.00
18: 2016-08-01 18:00:00   BD m24-31    6.50
19: 2016-08-01 21:00:00   BD m24-31    2.75
20: 2016-08-02 00:00:00   BD m24-31    3.40
21: 2016-08-01 12:00:00   BD m32-39    3.00
22: 2016-08-01 15:00:00   BD m32-39    4.00
23: 2016-08-01 18:00:00   BD m32-39    4.75
24: 2016-08-01 21:00:00   BD m32-39    3.00
25: 2016-08-02 00:00:00   BD m32-39    1.70
               Datetime Site  layer current

Now, df1 has to be amended to include columns which correspond to layer and Datetime in currents.

For Depth, the cut() function is used. The last layer level m32-39 is extended to Inf so all depths greater 32 m are included in this level as requested by the OP.

For Datetime, a rolling join to the nearest Datetime in df2 is used. This is possible because df2$Datetime denotes the mid-point of the 3 hour time range.

After df1 has been prepared, we can do the "partial anti-join". Unfortunately, data.table's non-equi joins does not accept the != operator. So, we cannot write

currents[df1, on = .(Datetime = current.dt, Site, layer != layer)]

directly but have to use a work-around where we first pick the rows where we expect matches and then do an anti-join:

 currents[df1, on = .(Datetime = current.dt, Site)][
    !df1, on = .(Datetime = current.dt, Site, layer)]
               Datetime Site  layer current          i.Datetime Ind Depth i.layer
 1: 2016-08-01 15:00:00   BD  m8-15    4.00 2016-08-01 15:34:07  16   5.3    m0-7
 2: 2016-08-01 15:00:00   BD m16-23    4.00 2016-08-01 15:34:07  16   5.3    m0-7
 3: 2016-08-01 15:00:00   BD m24-31    3.00 2016-08-01 15:34:07  16   5.3    m0-7
 4: 2016-08-01 15:00:00   BD m32-39    4.00 2016-08-01 15:34:07  16   5.3    m0-7
 5: 2016-08-01 18:00:00   BD   m0-7    6.75 2016-08-01 17:29:16  19  36.4  m32-39
 6: 2016-08-01 18:00:00   BD  m8-15    4.75 2016-08-01 17:29:16  19  36.4  m32-39
 7: 2016-08-01 18:00:00   BD m16-23    5.75 2016-08-01 17:29:16  19  36.4  m32-39
 8: 2016-08-01 18:00:00   BD m24-31    6.50 2016-08-01 17:29:16  19  36.4  m32-39
 9: 2016-08-01 18:00:00   BD   m0-7    6.75 2016-08-01 18:33:16  16  42.0  m32-39
10: 2016-08-01 18:00:00   BD  m8-15    4.75 2016-08-01 18:33:16  16  42.0  m32-39
11: 2016-08-01 18:00:00   BD m16-23    5.75 2016-08-01 18:33:16  16  42.0  m32-39
12: 2016-08-01 18:00:00   BD m24-31    6.50 2016-08-01 18:33:16  16  42.0  m32-39
13: 2016-08-01 21:00:00   BD   m0-7    2.25 2016-08-01 20:54:16  17    NA    <NA>
14: 2016-08-01 21:00:00   BD  m8-15    3.00 2016-08-01 20:54:16  17    NA    <NA>
15: 2016-08-01 21:00:00   BD m16-23    2.25 2016-08-01 20:54:16  17    NA    <NA>
16: 2016-08-01 21:00:00   BD m24-31    2.75 2016-08-01 20:54:16  17    NA    <NA>
17: 2016-08-01 21:00:00   BD m32-39    3.00 2016-08-01 20:54:16  17    NA    <NA>
18: 2016-08-02 00:00:00   BD   m0-7    4.30 2016-08-01 22:48:16  16  22.1  m16-23
19: 2016-08-02 00:00:00   BD  m8-15    2.10 2016-08-01 22:48:16  16  22.1  m16-23
20: 2016-08-02 00:00:00   BD m24-31    3.40 2016-08-01 22:48:16  16  22.1  m16-23
21: 2016-08-02 00:00:00   BD m32-39    1.70 2016-08-01 22:48:16  16  22.1  m16-23
22:                <NA>   HG   <NA>      NA 2016-08-01 16:25:16  17  24.0  m16-23
               Datetime Site  layer current          i.Datetime Ind Depth i.layer

This can be aggregated as desired by an arbitrary aggregation function (no need to manually add single columns selectively):

currents_other_layers <- 
  currents[df1, on = .(Datetime = current.dt, Site)][
    !df1, on = .(Datetime = current.dt, Site, layer)][
      !is.na(Depth), mean(current), by = .(i.Datetime, Site)]

currents_other_layers
            i.Datetime Site     V1
1: 2016-08-01 15:34:07   BD 3.7500
2: 2016-08-01 17:29:16   BD 5.9375
3: 2016-08-01 18:33:16   BD 5.9375
4: 2016-08-01 22:48:16   BD 2.8750
5: 2016-08-01 16:25:16   HG     NA

This result contains the average currents of all other layers except the observed layer. Note that grouping is by i.Datetime which refers to df1$Datetime and Site. Rows where Depth is missing in df1 are omitted to meet OP's expected result.

A final update join appends the result column to df1.

I'd approach this in two steps:

  1. make a lookup table with avg_speed_elsewhere for each Datetime, Site, and Depth in df2.
  2. Join to df1.

Here's a lookup table:

library(tidyverse)
df2_long <- df2 %>%
  gather(depth_rng, speed, `m0-7`:`m32-39`) %>%
  separate(depth_rng, c("min_depth", "max_depth")) %>%
  mutate_at(vars(matches("depth")), parse_number) %>%
  # EDIT -- added to make deep category cover >39 too
  mutate(max_depth = if_else(max_depth == 39, 10000, max_depth)) %>%
  group_by(Datetime, Site) %>%
  # Avg Speed elsewhere is the sum of all speeds, minus this speed, all divided by 4.
  mutate(avg_speed_elsewhere = (sum(speed) - speed) / 4)

> df2_long
# A tibble: 25 x 6
# Groups:   Datetime, Site [5]
   Datetime            Site  min_depth max_depth speed avg_speed_elsewhere
   <dttm>              <fct>     <dbl>     <dbl> <dbl>               <dbl>
 1 2016-08-18 12:00:00 BD            0         7  2.75                3   
 2 2016-08-18 15:00:00 BD            0         7  4                   3.75
 3 2016-08-18 18:00:00 BD            0         7  6.75                5.44
 4 2016-08-18 21:00:00 BD            0         7  2.25                2.75
 5 2016-08-19 00:00:00 BD            0         7  4.3                 2.15
 6 2016-08-18 12:00:00 BD            8        15  3                   2.94
 7 2016-08-18 15:00:00 BD            8        15  4                   3.75
 8 2016-08-18 18:00:00 BD            8        15  4.75                5.94
 9 2016-08-18 21:00:00 BD            8        15  3                   2.56
10 2016-08-19 00:00:00 BD            8        15  2.1                 2.7 
# ... with 15 more rows

I expect this will work, but your provided data doesn't overlap so I'm not sure:

df1 %>%
  # EDIT - replaced floor_date with round_date
  mutate(Datetime_3hr = lubridate::round_date(Datetime, "3 hour")) %>%
  left_join(df2_long, by = c("Site", "Datetime_3hr" = "Datetime")) %>%
  filter(Depth >= min_depth & Depth < max_depth + 1 | is.na(Depth))

Using data.table you can do a rolling join between your two databases in order to associate your depth variables with your current variables even though the times don't match. What the rolling join does is associate one table with the other by which time is the closest (according to your chosen options). I changed a bit of your data so that the the days match

library(data.table)

df1<-data.frame(Datetime=c("2016-08-01 15:34:07","2016-08-01 16:25:16","2016-08-01 17:29:16","2016-08-01 18:33:16","2016-08-01 20:54:16","2016-08-01 22:48:16"),Site=c("BD","HG","BD","BD","BD","BD"),Ind=c(16,17,19,16,17,16), Depth=c(5.3,24,36.4,42,NA,22.1))
df1$Datetime<-as.POSIXct(df1$Datetime, format="%Y-%m-%d %H:%M:%S",tz="UTC")

df2<-data.frame(Datetime=c("2016-08-01 12:00:00","2016-08-01 15:00:00","2016-08-01 18:00:00","2016-08-01 21:00:00","2016-08-02 00:00:00"), Site=c("BD","BD","BD","BD","BD"),var1=c(2.75,4,6.75,2.25,4.3),var2=c(3,4,4.75,3,2.1),var3=c(2.75,4,5.75,2.25,1.4),var4=c(3.25,3,6.5,2.75,3.4),var5=c(3,4,4.75,3,1.7))
df2$Datetime<-as.POSIXct(df2$Datetime, format="%Y-%m-%d %H:%M:%S",tz="UTC")
colnames(df2)<-c("Datetime","Site","m0-7","m8-15","m16-23","m24-31","m32-39")

setDT(df1)
setDT(df2)

setkey(df1, Site, Datetime)
setkey(df2, Site, Datetime)

df_merge = df2[df1, roll = Inf]

Then I use dplyr's case_when to calculate the currents for other depths

library(dplyr)

df_merge[, current_elsewhere := case_when(
  is.na(Depth) ~ NA_real_,
  Depth < 7 ~ (`m8-15` + `m16-23` + `m24-31` + `m32-39`)/4,
  Depth < 15 ~ (`m0-7` + `m16-23` + `m24-31` + `m32-39`)/4,
  Depth < 23 ~ (`m0-7` + `m8-15` + `m24-31` + `m32-39`)/4,
  Depth < 31 ~ (`m0-7` + `m8-15` + `m16-23` + `m32-39`)/4,
  T ~ (`m0-7` + `m8-15` + `m16-23` + `m24-31`)/4)]

df_merge
              Datetime Site m0-7 m8-15 m16-23 m24-31 m32-39 Ind Depth current_elsewhere
1: 2016-08-01 15:34:07   BD 4.00  4.00   4.00   3.00   4.00  16   5.3            3.7500
2: 2016-08-01 17:29:16   BD 4.00  4.00   4.00   3.00   4.00  19  36.4            3.7500
3: 2016-08-01 18:33:16   BD 6.75  4.75   5.75   6.50   4.75  16  42.0            5.9375
4: 2016-08-01 20:54:16   BD 6.75  4.75   5.75   6.50   4.75  17    NA                NA
5: 2016-08-01 22:48:16   BD 2.25  3.00   2.25   2.75   3.00  16  22.1            2.7500
6: 2016-08-01 16:25:16   HG   NA    NA     NA     NA     NA  17  24.0                NA
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!