问题
This question is very similar to Using pmap to apply different regular expressions to different variables in a tibble?, but differs because I realized my examples were not sufficient to describe my problem.
I'm trying to apply different regular expressions to different variables in a tibble. For example, I've made a tibble listing 1) the variable name I want to modify, 2) the regex I want to match, and 3) the replacement string. I'd like to apply the regex/replacement to the variable in a different data frame. Note that there may be variables in the target tibble that I don't want to modify, and the row order in my "configuration" tibble may not correspond to the column/variable order in my "target" tibble.
So my "configuration" tibble could look like this:
test_config <- dplyr::tibble(
string_col = c("col1", "col2", "col4", "col3"),
pattern = c("^\\.$", "^NA$", "^$", "^NULL$"),
replacement = c("","","", "")
)
I'd like to apply this to a target tibble:
test_target <- dplyr::tibble(
col1 = c("Foo", "bar", ".", "NA", "NULL"),
col2 = c("Foo", "bar", ".", "NA", "NULL"),
col3 = c("Foo", "bar", ".", "NA", "NULL"),
col4 = c("NULL", "NA", "Foo", ".", "bar"),
col5 = c("I", "am", "not", "changing", ".")
)
So the goal is to replace a different string with an empty string in user-specified column/variables of the test_target.
The result should be like this:
result <- dplyr::tibble(
col1 = c("Foo", "bar", "", "NA", "NULL"),
col2 = c("Foo", "bar", ".", "", "NULL"),
col3 = c("Foo", "bar", ".", "NA", ""),
col4 = c("NULL", "NA", "Foo", ".", "bar"),
col5 = c("I", "am", "not", "changing", ".")
)
I can do what I want with a for loop, like this:
for (i in seq(nrow(test_config))) {
test_target <- dplyr::mutate_at(test_target,
.vars = dplyr::vars(
tidyselect::matches(test_config$string_col[[i]])),
.funs = dplyr::funs(
stringr::str_replace_all(
., test_config$pattern[[i]],
test_config$replacement[[i]]))
)
}
Instead, is there a more tidy way to do what I want?
So far, thinking that purrr::pmap
was the tool for the job, I've made a function that takes a data frame, variable name, regular expression, and replacement value and returns the data frame with a single variable modified. It behaves as expected:
testFun <- function(df, colName, regex, repVal){
colName <- dplyr::enquo(colName)
df <- dplyr::mutate_at(df,
.vars = dplyr::vars(
tidyselect::matches(!!colName)),
.funs = dplyr::funs(
stringr::str_replace_all(., regex, repVal))
)
}
# try with example
out <- testFun(test_target,
test_config$string_col[[1]],
test_config$pattern[[1]],
"")
However, when I try to use that function with pmap
, I run into a couple problems:
1) is there a better way to build the list for the pmap call than this?
purrr::pmap(
list(test_target,
test_config$string_col,
test_config$pattern,
test_config$replacement),
testFun
)
2) When I call pmap
, I get an error:
Error: Element 2 has length 4, not 1 or 5.
So pmap
isn't happy that I'm trying to pass a tibble of length 5 as an element of a list whose other elements are of length 4 (I thought it would recycle the tibble).
Note also that previously, when I called pmap
with a 4-row tibble, I got a different error,
Error in UseMethod("tbl_vars") :
no applicable method for 'tbl_vars' applied to an object of class "character"
Called from: tbl_vars(tbl)
Can any of you suggest a way to use pmap to do what I want, or is there a different or better tidyverse approach to the problem?
Thanks!
回答1:
Here are two tidyverse
ways. One is similar to the data.table
answer, in that it involves reshaping the data, joining it with the configs, and reshaping back to wide. The other is purrr
-based and, in my opinion, a little bit of a weird approach. I'd recommend the first, since it feels more intuitive.
Use tidyr::gather
to make the data long-shaped, then dplyr::left_join
to make sure that every text value from test_target
has a corresponding pattern & replacement—even the cases (col5) without patterns will be retained by using a left join.
library(tidyverse)
...
test_target %>%
gather(key = col, value = text) %>%
left_join(test_config, by = c("col" = "string_col"))
#> # A tibble: 25 x 4
#> col text pattern replacement
#> <chr> <chr> <chr> <chr>
#> 1 col1 Foo "^\\.$" ""
#> 2 col1 bar "^\\.$" ""
#> 3 col1 . "^\\.$" ""
#> 4 col1 NA "^\\.$" ""
#> 5 col1 NULL "^\\.$" ""
#> 6 col2 Foo ^NA$ ""
#> 7 col2 bar ^NA$ ""
#> 8 col2 . ^NA$ ""
#> 9 col2 NA ^NA$ ""
#> 10 col2 NULL ^NA$ ""
#> # ... with 15 more rows
Using an ifelse
replace the pattern where a pattern exists, or keep the original text if the pattern doesn't. Keep just the necessary patterns, add a row number because spread
needs unique IDs, and make the data wide again.
test_target %>%
gather(key = col, value = text) %>%
left_join(test_config, by = c("col" = "string_col")) %>%
mutate(new_text = ifelse(is.na(pattern), text, str_replace(text, pattern, replacement))) %>%
select(col, new_text) %>%
group_by(col) %>%
mutate(row = row_number()) %>%
spread(key = col, value = new_text) %>%
select(-row)
#> # A tibble: 5 x 5
#> col1 col2 col3 col4 col5
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Foo Foo Foo NULL I
#> 2 bar bar bar NA am
#> 3 "" . . Foo not
#> 4 NA "" NA . changing
#> 5 NULL NULL "" bar .
The second way is to make a tiny tibble of just the column names, join that with the configs, and split into a list of lists. Then purrr::map2_dfc
maps over both this list you've created and the columns of test_target
, and returns a data frame by cbind
ing. The reason this works is that data frames are technically lists of columns, so if you map over a data frame, you're treating each column like a list item. I couldn't get a ifelse
to work right here—something in the logic had only single strings coming back instead of the whole vector.
tibble(all_cols = names(test_target)) %>%
left_join(test_config, by = c("all_cols" = "string_col")) %>%
split(.$all_cols) %>%
map(as.list) %>%
map2_dfc(test_target, function(info, text) {
if (is.na(info$pattern)) {
text
} else {
str_replace(text, info$pattern, info$replacement)
}
})
#> # A tibble: 5 x 5
#> col1 col2 col3 col4 col5
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Foo Foo Foo NULL I
#> 2 bar bar bar NA am
#> 3 "" . . Foo not
#> 4 NA "" NA . changing
#> 5 NULL NULL "" bar .
Created on 2018-10-30 by the reprex package (v0.2.1)
回答2:
I'm not experienced with purrr
and dplyr
, but here is an approach with data.table
. The approach can be moved in to dplyr with a bit of googling :)
In terms of interpretability, the approach with the loop is arguably better as its simpler.
edit: pushed some changes to code, wasn't using purrr in the end
# alternative with data.table
library(data.table)
library(dplyr)
# objects
test_config <- dplyr::tibble(
string_col = c("col1", "col2", "col4", "col3"),
pattern = c("^\\.$", "^NA$", "^$", "^NULL$"),
replacement = c("","","", "")
)
test_target <- dplyr::tibble(
col1 = c("Foo", "bar", ".", "NA", "NULL"),
col2 = c("Foo", "bar", ".", "NA", "NULL"),
col3 = c("Foo", "bar", ".", "NA", "NULL"),
col4 = c("NULL", "NA", "Foo", ".", "bar"),
col5 = c("I", "am", "not", "changing", ".")
)
multiColStringReplace <- function(test_target, test_config){
# data.table conversion
test_target <- as.data.table(test_target)
test_config <- as.data.table(test_config)
# adding an id column, as I'm reshaping the data, helps for identification of rows
# throughout the process
test_target[,id:=1:.N]
# wide to long format
test_target2 <- melt(test_target, id.vars="id")
head(test_target2)
# pull in the configuration, can join up on one column now
test_target2 <- merge(test_target2, test_config, by.x="variable",
by.y="string_col", all.x=TRUE)
# this bit still looks messy to me, haven't used pmap before.
# I've had to subset the data to the required format, run the pmap with gsub
# to complete the task, then assign the unlisted vector back in to the original
# data. Would like to see a better option too!
test_target2[, result := value]
test_target2[!is.na(pattern), result := gsub(pattern, replacement, value),
by = .(id, variable)]
# case from long to original format, and drop the id
output <- dcast(test_target2, id~variable,
value.var = "result")
output[, id := NULL]
# back to tibble
output <- as_tibble(output)
return(output)
}
output <- multiColStringReplace(test_target, test_config)
output
result <- dplyr::tibble(
col1 = c("Foo", "bar", "", "NA", "NULL"),
col2 = c("Foo", "bar", ".", "", "NULL"),
col3 = c("Foo", "bar", ".", "NA", ""),
col4 = c("NULL", "NA", "Foo", ".", "bar"),
col5 = c("I", "am", "not", "changing", ".")
)
output == result
# compare with old method
old <- test_target
for (i in seq(nrow(test_config))) {
old <- dplyr::mutate_at(old,
.vars = dplyr::vars(
tidyselect::matches(test_config$string_col[[i]])),
.funs = dplyr::funs(
stringr::str_replace_all(
., test_config$pattern[[i]],
test_config$replacement[[i]]))
)
}
old == result
# speed improves, but complexity rises
microbenchmark::microbenchmark("old" = {
old <- test_target
for (i in seq(nrow(test_config))) {
old <- dplyr::mutate_at(old,
.vars = dplyr::vars(
tidyselect::matches(test_config$string_col[[i]])),
.funs = dplyr::funs(
stringr::str_replace_all(
., test_config$pattern[[i]],
test_config$replacement[[i]]))
)
}
},
"data.table" = {
multiColStringReplace(test_target, test_config)
}, times = 20)
回答3:
For posterity's sake, I can also accomplish this task if I pass the test_target
tibble to pmap_dfr
as a list (but it's not a good solution):
purrr::pmap_dfr(
list(list(test_target),
test_config$string_col,
test_config$pattern,
test_config$replacement),
testFun
) %>% dplyr::distinct()
Although it works, this isn't a good solution because it recycles the elements of the test_target
list, effectively making a copies of test_target tibble for each line of test_config as it advances though the arguments, then binds the rows of the resulting 4 tibbles together to make a big final output tibble (which I'm filtering back down with the distinct()
.
There may be some way to do something like a <<-
-like approach to avoid duplicating the target tibble, but that's even more weird and bad.
回答4:
FYI, benchmarking results - the "awkward tidy" approach @camille suggested is the winner on my hardware!
Unit: milliseconds
expr min lq mean median uq max neval
loop 14.808278 16.098818 17.937283 16.811716 20.438360 24.38021 20
pmap_function 9.486146 10.157526 10.978879 10.628205 11.112485 15.39436 20
nice_tidy 8.313868 8.633266 9.597485 8.986735 9.870532 14.32946 20
awkward_tidy 1.535919 1.639706 1.772211 1.712177 1.783465 2.87615 20
data.table 5.611538 5.652635 8.323122 5.784507 6.359332 51.63031 20
来源:https://stackoverflow.com/questions/53071578/using-pmap-with-a-to-apply-different-regular-expressions-to-different-variables