dplyr mutate specific columns by evaluating lookup cell value

问题

I have explored various options using quosures, symbols, and evaluation, but I can't seem to get the right syntax. Here is an example dataframe.

data.frame("A" = letters[1:4], "B" = letters[26:23], "C" = letters[c(1,3,5,7)], "D" = letters[c(2,4,6,8)], "pastecols" = c("B, C","B, D", "B, C, D", NA))
  A B C D pastecols
1 a z a b      B, C
2 b y c d      B, D
3 c x e f   B, C, D
4 d w g h      <NA>

Now suppose I want to paste values from different columns based on the lookup string in pastecols, and I always want to include column A. This is my desired result:

  A B C D pastecols  result
1 a z a b      B, C   a z a
2 b y c d      B, D   b y d
3 c x e f   B, C, D c x e f
4 d w g h      <NA>       d

Ideally this could be done in dplyr. This is the closest I have gotten:

x %>% mutate(result = lapply(lapply(str_split(pastecols, ", "), c, "A"), na.omit))
  A B C D pastecols     result
1 a z a b      B, C    B, C, A
2 b y c d      B, D    B, D, A
3 c x e f   B, C, D B, C, D, A
4 d w g h      <NA>          A

回答1:

Here's one way using pmap to do a similar thing. pmap can be used to effectively work on dataframes by row by capturing each row as a named vector; you can then get the desired column names for indexing as cols by selecting them with ["pastecols"].

Most of the anonymous function syntax is not tidyverse stuff, but just basic R stuff. To walk through it:

Pass the dataframe as the list to the .l argument of pmap_chr. Remember that dataframes are lists of columns!
Capture all the ... arguments with c(...). Basically we are calling each row of the dataframe as arguments to the function; now row is a named vector containing the row. Note that if you have list-columns this will break, (but so will a lot of other things here so I assume there aren't any...)
We can get the values of row that we want from row["pastecols"], but we need to turn (say) "B, C" into c("A", "B", "C") to do that. This next line just adds the "A", replaces missing values with "A", splits into pieces if there are any, and then indexes back down into the list. The [[ part is just how you do list[[1]]" in a pipe chain, it's the prefix form of the operator. You need this because str_split returns a list and we just want the vector.
Use this cols vector to get the desired values from row and return it, collapsed into a length 1 character vector!

library(tidyverse)
tbl <- tibble("A" = letters[1:4], "B" = letters[26:23], "C" = letters[c(1,3,5,7)], "D" = letters[c(2,4,6,8)], "pastecols" = c("B, C","B, D", "B, C, D", NA))

tbl %>%
  mutate(result = pmap_chr(
    .l = .,
    .f = function(...){
      row <-  c(...)
      cols <- row["pastecols"] %>% str_c("A, ", .) %>% replace_na("A") %>% str_split(", ") %>% `[[`(1)
      vals <- row[cols] %>% str_c(collapse = ", ")
      return(vals)
    }
  ))
#> # A tibble: 4 x 6
#>   A     B     C     D     pastecols result    
#>   <chr> <chr> <chr> <chr> <chr>     <chr>     
#> 1 a     z     a     b     B, C      a, z, a   
#> 2 b     y     c     d     B, D      b, y, d   
#> 3 c     x     e     f     B, C, D   c, x, e, f
#> 4 d     w     g     h     <NA>      d

Created on 2018-12-03 by the reprex package (v0.2.0).

回答2:

Not the most elegant solution but gets the job done with just base R. If column A never shows up in pastecols you can remove unique() from the code.

for(r in seq_len(nrow(df))) {
  df$result[r] <- paste(
                    df[r, na.omit(unique(c("A", unlist(strsplit(df$pastecols[r], ", ")))))],
                    collapse = " "
                  )
}
df

  A B C D pastecols  result
1 a z a b      B, C   a z a
2 b y c d      B, D   b y d
3 c x e f   B, C, D c x e f
4 d w g h      <NA>       d

Data -

df <- data.frame(
  "A" = letters[1:4], 
  "B" = letters[26:23], 
  "C" = letters[c(1,3,5,7)], 
  "D" = letters[c(2,4,6,8)], 
  "pastecols" = c("B, C","B, D", "B, C, D", NA), stringsAsFactors = F
)

回答3:

Here's a different way that doesn't rely on iterating functions in the apply or map families, if you would prefer to avoid them, and tries to leverage the tidyr side of the tidyverse. The approach is basically to expand the dataframe with gather and separate_rows into each combination of pastecols and actual columns, and then filter so we only keep the ones that match for each rowid. Once we have that, we can group_by and summarise to bring it back to one row per rowid. There is a bunch of housekeeping to deal with the fact that you always have column A, and note that I leave A in the output pastecols, but you can remove that if you want to.

library(tidyverse)
tbl <- tibble("A" = letters[1:4], "B" = letters[26:23], "C" = letters[c(1,3,5,7)], "D" = letters[c(2,4,6,8)], "pastecols" = c("B, C","B, D", "B, C, D", NA))

tbl %>%
  rowid_to_column() %>%
  mutate(
    pastecols = str_c("A, ", pastecols),
    pastecols = if_else(is.na(pastecols), "A", pastecols)
  ) %>%
  gather(colname, value, -pastecols, -rowid) %>%
  separate_rows(pastecols) %>%
  filter(pastecols == colname) %>%
  group_by(rowid) %>%
  summarise(
    pastecols = str_c(pastecols, collapse = ", "),
    result = str_c(value, collapse = ", ")
  )
#> # A tibble: 4 x 3
#>   rowid pastecols  result    
#>   <int> <chr>      <chr>     
#> 1     1 A, B, C    a, z, a   
#> 2     2 A, B, D    b, y, d   
#> 3     3 A, B, C, D c, x, e, f
#> 4     4 A          d

Created on 2018-12-03 by the reprex package (v0.2.0).

来源：https://stackoverflow.com/questions/53599684/dplyr-mutate-specific-columns-by-evaluating-lookup-cell-value

标签

dplyr

eval