Using pmap with a to apply different regular expressions to different variables in a tibble?

问题

This question is very similar to Using pmap to apply different regular expressions to different variables in a tibble?, but differs because I realized my examples were not sufficient to describe my problem.

I'm trying to apply different regular expressions to different variables in a tibble. For example, I've made a tibble listing 1) the variable name I want to modify, 2) the regex I want to match, and 3) the replacement string. I'd like to apply the regex/replacement to the variable in a different data frame. Note that there may be variables in the target tibble that I don't want to modify, and the row order in my "configuration" tibble may not correspond to the column/variable order in my "target" tibble.

So my "configuration" tibble could look like this:

test_config <-  dplyr::tibble(
  string_col = c("col1", "col2", "col4", "col3"),
  pattern = c("^\\.$", "^NA$", "^$", "^NULL$"),
  replacement = c("","","", "")
)

I'd like to apply this to a target tibble:

test_target <- dplyr::tibble(
  col1 = c("Foo", "bar", ".", "NA", "NULL"),
  col2 = c("Foo", "bar", ".", "NA", "NULL"),
  col3 = c("Foo", "bar", ".", "NA", "NULL"),
  col4 = c("NULL", "NA", "Foo", ".", "bar"),
  col5 = c("I", "am", "not", "changing", ".")
)

So the goal is to replace a different string with an empty string in user-specified column/variables of the test_target.

The result should be like this:

result <- dplyr::tibble(
  col1 = c("Foo", "bar", "", "NA", "NULL"),
  col2 = c("Foo", "bar", ".", "", "NULL"),
  col3 = c("Foo", "bar", ".", "NA", ""),
  col4 = c("NULL", "NA", "Foo", ".", "bar"),
  col5 = c("I", "am", "not", "changing", ".")
)

I can do what I want with a for loop, like this:

for (i in seq(nrow(test_config))) {
  test_target <- dplyr::mutate_at(test_target,
                   .vars = dplyr::vars(
                     tidyselect::matches(test_config$string_col[[i]])),
                   .funs = dplyr::funs(
                     stringr::str_replace_all(
                       ., test_config$pattern[[i]], 
                       test_config$replacement[[i]]))
  )
}

Instead, is there a more tidy way to do what I want? So far, thinking that purrr::pmap was the tool for the job, I've made a function that takes a data frame, variable name, regular expression, and replacement value and returns the data frame with a single variable modified. It behaves as expected:

testFun <- function(df, colName, regex, repVal){
  colName <- dplyr::enquo(colName)
  df <- dplyr::mutate_at(df,
                         .vars = dplyr::vars(
                           tidyselect::matches(!!colName)),
                         .funs = dplyr::funs(
                           stringr::str_replace_all(., regex, repVal))
  )
}

# try with example
out <- testFun(test_target, 
               test_config$string_col[[1]], 
               test_config$pattern[[1]], 
               "")

However, when I try to use that function with pmap, I run into a couple problems: 1) is there a better way to build the list for the pmap call than this?

purrr::pmap(
    list(test_target, 
         test_config$string_col, 
         test_config$pattern, 
         test_config$replacement),
    testFun
)

2) When I call pmap, I get an error:

Error: Element 2 has length 4, not 1 or 5.

So pmap isn't happy that I'm trying to pass a tibble of length 5 as an element of a list whose other elements are of length 4 (I thought it would recycle the tibble).

Note also that previously, when I called pmap with a 4-row tibble, I got a different error,

Error in UseMethod("tbl_vars") : 
  no applicable method for 'tbl_vars' applied to an object of class "character"
Called from: tbl_vars(tbl)

Can any of you suggest a way to use pmap to do what I want, or is there a different or better tidyverse approach to the problem?

Thanks!

回答1:

Here are two tidyverse ways. One is similar to the data.table answer, in that it involves reshaping the data, joining it with the configs, and reshaping back to wide. The other is purrr-based and, in my opinion, a little bit of a weird approach. I'd recommend the first, since it feels more intuitive.

Use tidyr::gather to make the data long-shaped, then dplyr::left_join to make sure that every text value from test_target has a corresponding pattern & replacement—even the cases (col5) without patterns will be retained by using a left join.

library(tidyverse)
...

test_target %>%
  gather(key = col, value = text) %>%
  left_join(test_config, by = c("col" = "string_col"))
#> # A tibble: 25 x 4
#>    col   text  pattern replacement
#>    <chr> <chr> <chr>   <chr>      
#>  1 col1  Foo   "^\\.$" ""         
#>  2 col1  bar   "^\\.$" ""         
#>  3 col1  .     "^\\.$" ""         
#>  4 col1  NA    "^\\.$" ""         
#>  5 col1  NULL  "^\\.$" ""         
#>  6 col2  Foo   ^NA$    ""         
#>  7 col2  bar   ^NA$    ""         
#>  8 col2  .     ^NA$    ""         
#>  9 col2  NA    ^NA$    ""         
#> 10 col2  NULL  ^NA$    ""         
#> # ... with 15 more rows

Using an ifelse replace the pattern where a pattern exists, or keep the original text if the pattern doesn't. Keep just the necessary patterns, add a row number because spread needs unique IDs, and make the data wide again.

test_target %>%
  gather(key = col, value = text) %>%
  left_join(test_config, by = c("col" = "string_col")) %>% 
  mutate(new_text = ifelse(is.na(pattern), text, str_replace(text, pattern, replacement))) %>%
  select(col, new_text) %>%
  group_by(col) %>%
  mutate(row = row_number()) %>%
  spread(key = col, value = new_text) %>%
  select(-row)
#> # A tibble: 5 x 5
#>   col1  col2  col3  col4  col5    
#>   <chr> <chr> <chr> <chr> <chr>   
#> 1 Foo   Foo   Foo   NULL  I       
#> 2 bar   bar   bar   NA    am      
#> 3 ""    .     .     Foo   not     
#> 4 NA    ""    NA    .     changing
#> 5 NULL  NULL  ""    bar   .

The second way is to make a tiny tibble of just the column names, join that with the configs, and split into a list of lists. Then purrr::map2_dfc maps over both this list you've created and the columns of test_target, and returns a data frame by cbinding. The reason this works is that data frames are technically lists of columns, so if you map over a data frame, you're treating each column like a list item. I couldn't get a ifelse to work right here—something in the logic had only single strings coming back instead of the whole vector.

tibble(all_cols = names(test_target)) %>%
  left_join(test_config, by = c("all_cols" = "string_col")) %>%
  split(.$all_cols) %>%
  map(as.list) %>%
  map2_dfc(test_target, function(info, text) {
    if (is.na(info$pattern)) {
      text
    } else {
      str_replace(text, info$pattern, info$replacement)
    }
  })
#> # A tibble: 5 x 5
#>   col1  col2  col3  col4  col5    
#>   <chr> <chr> <chr> <chr> <chr>   
#> 1 Foo   Foo   Foo   NULL  I       
#> 2 bar   bar   bar   NA    am      
#> 3 ""    .     .     Foo   not     
#> 4 NA    ""    NA    .     changing
#> 5 NULL  NULL  ""    bar   .

^{Created on 2018-10-30 by the reprex package (v0.2.1)}

回答2:

I'm not experienced with purrr and dplyr, but here is an approach with data.table. The approach can be moved in to dplyr with a bit of googling :)

In terms of interpretability, the approach with the loop is arguably better as its simpler.

edit: pushed some changes to code, wasn't using purrr in the end

# alternative with data.table
library(data.table)
library(dplyr)

# objects
test_config <-  dplyr::tibble(
  string_col = c("col1", "col2", "col4", "col3"),
  pattern = c("^\\.$", "^NA$", "^$", "^NULL$"),
  replacement = c("","","", "")
)
test_target <- dplyr::tibble(
  col1 = c("Foo", "bar", ".", "NA", "NULL"),
  col2 = c("Foo", "bar", ".", "NA", "NULL"),
  col3 = c("Foo", "bar", ".", "NA", "NULL"),
  col4 = c("NULL", "NA", "Foo", ".", "bar"),
  col5 = c("I", "am", "not", "changing", ".")
)

multiColStringReplace <- function(test_target, test_config){

  # data.table conversion
  test_target <- as.data.table(test_target)
  test_config <- as.data.table(test_config)

  # adding an id column, as I'm reshaping the data, helps for identification of rows
  # throughout the process
  test_target[,id:=1:.N]

  # wide to long format
  test_target2 <- melt(test_target, id.vars="id")
  head(test_target2)

  # pull in the configuration, can join up on one column now
  test_target2 <- merge(test_target2, test_config, by.x="variable",
                        by.y="string_col", all.x=TRUE)

  # this bit still looks messy to me, haven't used pmap before.
  # I've had to subset the data to the required format, run the pmap with gsub
  # to complete the task, then assign the unlisted vector back in to the original
  # data. Would like to see a better option too!
  test_target2[, result := value]
  test_target2[!is.na(pattern), result := gsub(pattern, replacement, value),
               by = .(id, variable)]

  # case from long to original format, and drop the id
  output <- dcast(test_target2, id~variable,
                  value.var = "result")
  output[, id := NULL]

  # back to tibble
  output <- as_tibble(output)

  return(output)

}

output <- multiColStringReplace(test_target, test_config)
output

result <- dplyr::tibble(
  col1 = c("Foo", "bar", "", "NA", "NULL"),
  col2 = c("Foo", "bar", ".", "", "NULL"),
  col3 = c("Foo", "bar", ".", "NA", ""),
  col4 = c("NULL", "NA", "Foo", ".", "bar"),
  col5 = c("I", "am", "not", "changing", ".")
)
output == result

# compare with old method
old <- test_target
for (i in seq(nrow(test_config))) {
  old <- dplyr::mutate_at(old,
                          .vars = dplyr::vars(
                            tidyselect::matches(test_config$string_col[[i]])),
                          .funs = dplyr::funs(
                            stringr::str_replace_all(
                              ., test_config$pattern[[i]], 
                              test_config$replacement[[i]]))
  )
}
old == result

# speed improves, but complexity rises
microbenchmark::microbenchmark("old" = {
  old <- test_target
  for (i in seq(nrow(test_config))) {
    old <- dplyr::mutate_at(old,
                            .vars = dplyr::vars(
                              tidyselect::matches(test_config$string_col[[i]])),
                            .funs = dplyr::funs(
                              stringr::str_replace_all(
                                ., test_config$pattern[[i]], 
                                test_config$replacement[[i]]))
    )
  }
},
"data.table" = {
  multiColStringReplace(test_target, test_config)
}, times = 20)

回答3:

For posterity's sake, I can also accomplish this task if I pass the test_target tibble to pmap_dfr as a list (but it's not a good solution):

purrr::pmap_dfr(
  list(list(test_target),
       test_config$string_col,
       test_config$pattern,
       test_config$replacement),
  testFun
) %>% dplyr::distinct()

Although it works, this isn't a good solution because it recycles the elements of the test_target list, effectively making a copies of test_target tibble for each line of test_config as it advances though the arguments, then binds the rows of the resulting 4 tibbles together to make a big final output tibble (which I'm filtering back down with the distinct().

There may be some way to do something like a <<--like approach to avoid duplicating the target tibble, but that's even more weird and bad.

回答4:

FYI, benchmarking results - the "awkward tidy" approach @camille suggested is the winner on my hardware!

Unit: milliseconds
          expr       min        lq      mean    median        uq      max neval
          loop 14.808278 16.098818 17.937283 16.811716 20.438360 24.38021    20
 pmap_function  9.486146 10.157526 10.978879 10.628205 11.112485 15.39436    20
     nice_tidy  8.313868  8.633266  9.597485  8.986735  9.870532 14.32946    20
  awkward_tidy  1.535919  1.639706  1.772211  1.712177  1.783465  2.87615    20
    data.table  5.611538  5.652635  8.323122  5.784507  6.359332 51.63031    20

来源：https://stackoverflow.com/questions/53071578/using-pmap-with-a-to-apply-different-regular-expressions-to-different-variables

标签

purrr