Assign intermediate output to temp variable as part of dplyr pipeline

前端 未结 5 2277
情话喂你
情话喂你 2020-11-30 12:10

Q: In an R dplyr pipeline, how can I assign some intermediate output to a temp variable for use further down the pipeline?

My approach below works. But it assigns in

相关标签:
5条回答
  • 2020-11-30 12:43

    pipeR is a package that extends the capabilities of the pipe without adding different pipes (as magrittr does). To assign, you pass a variable name, quoted with ~ in parentheses as an element in your pipe:

    library(dplyr)
    library(pipeR)
    
    df %>>%
      filter(b < 3) %>>%
      (~tmp) %>>% 
      mutate(b = b*2) %>>%
      bind_rows(tmp)
    ##   a b
    ## 1 A 2
    ## 2 B 4
    ## 3 A 1
    ## 4 B 2
    
    tmp
    ##   a b
    ## 1 A 1
    ## 2 B 2
    

    While the syntax is not terribly descriptive, pipeR is very well documented.

    0 讨论(0)
  • 2020-11-30 12:51

    I was interested in the question for the sake of debugging (wanting to save intermediate results so that I can inspect and manipulate them from the console without having to separate the pipeline into two pieces which is cumbersome. So, for my purposes, the only problem with the OP's solution original solution was that it was slightly verbose.

    This as can be fixed by defining a helper function:

    to_var <- function(., ..., env=.GlobalEnv) {
      var_name = quo_name(quos(...)[[1]])
      assign(var_name, ., envir=env)
      .
    }
    

    Which can then be used as follows:

    df <- data.frame(a = LETTERS[1:3], b=1:3)
    df %>%
      filter(b < 3) %>%
      to_var(tmp) %>%
      mutate(b = b*2) %>%
      bind_rows(tmp)
    # tmp still exists here
    

    That still uses the global environment, but you can also explicitly pass a more local environment as in the following example:

    f <- function() {
        df <- data.frame(a = LETTERS[1:3], b=1:3)
        env = environment()
        df %>%
          filter(b < 3) %>%
          to_var(tmp, env=env) %>%
          mutate(b = b*2) %>%
          bind_rows(tmp)
    }
    f()
    # tmp does not exist here
    

    The problem with the accepted solution is that it didn't seem to work out of the box with tidyverse pipes. G. Grothendieck's solution doesn't work for the debugging use case at all. (update: see G. Grothendieck's comment below and his updated answer!)

    Finally, the reason assign("tmp", .) %>% doesn't work is that the default 'envir' argument for assign() is the "current environment" (see documentation for assign) which is different at each stage of the pipeline. To see this, try inserting { print(environment()); . } %>% into the pipeline at various points and see that a different address is printed each time. (It is probably possible to tweak the definition of to_var so that the default is the grandparent environment instead.)

    0 讨论(0)
  • 2020-11-30 12:54

    You can generate the desired object at the location in the pipeline where it's needed. For example:

    df %>% filter(b < 3) %>% mutate(b = b*2) %>%
      bind_rows(df %>% filter(b < 3))
    

    This method avoids having to filter twice:

    df %>%
      filter(b < 3) %>%
      bind_rows(., mutate(., b = b*2))
    
    0 讨论(0)
  • 2020-11-30 13:05

    This does not create an object in the global environment:

    df %>% 
       filter(b < 3) %>% 
       { 
         { . -> tmp } %>% 
         mutate(b = b*2) %>% 
         bind_rows(tmp) 
       }
    

    This can also be used for debugging if you use . ->> tmp instead of . -> tmp or insert this into the pipeline:

    { browser(); . } %>% 
    
    0 讨论(0)
  • 2020-11-30 13:06

    I often find the need to save an intermediate product in a pipeline. While my use case is typically to avoid duplicating filters for later splitting, manipulation and reassembly, the technique can work well here:

    df %>%
      filter(b < 3) %>%
      {. ->> intermediateResult} %>%  # this saves intermediate 
      mutate(b = b*2) %>%
      bind_rows(intermediateResult)    
    
    0 讨论(0)
提交回复
热议问题