dplyr - mutate formula based on similarities in column names

*爱你&永不变心* 提交于 2019-11-28 06:15:19

问题


I am trying to find a better way to run a mutate() on a combination of columns based on parts of the column names.

For example, a way to simplify the mutate function in the following code:

df <- data.frame(LIMITED_A = c(100,200),
                UNLIMITED_A = c(25000,50000),
                LIMITED_B = c(300,300),
                UNLIMITED_B = c(500,500),
                LIMITED_C = c(2,10),
                UNLIMITED_C = c(5,20))

df %>%
  mutate(FINAL_LIMITED = (LIMITED_A - LIMITED_B) / LIMITED_C,
         FINAL_UNLIMITED = (UNLIMITED_A - UNLIMITED_B) / UNLIMITED_C)

A formula with the form: (._A - ._B) / ._C and the result is given the name FINAL_.

Is there a way to simplify this to a single line of code in the mutate function?


回答1:


Here is a different approach:

library(dplyr)
library(rlang)
library(glue)

dynamic_mutate = function(DF,  
                          col_names = gsub("(.*)_\\w+$", "\\1", names(DF)), 
                          expression = "({x}_A - {x}_B)/{x}_C",
                          prefix = "FINAL"){

  name_list = col_names %>% 
    unique() %>%
    as.list()

  expr_list = name_list %>%
    lapply(function(x) parse_quosure(glue(expression))) %>% 
    setNames(paste(prefix, name_list, sep = "_")) 

  DF %>% mutate(!!!expr_list)

}

Result:

> df %>%
+   dynamic_mutate()
  LIMITED_A UNLIMITED_A LIMITED_B UNLIMITED_B LIMITED_C UNLIMITED_C FINAL_LIMITED
1       100       25000       300         500         2           5          -100
2       200       50000       300         500        10          20           -10
  FINAL_UNLIMITED
1            4900
2            2475

> df %>%
+   dynamic_mutate(c("LIMITED", "UNLIMITED"), prefix = "NEW")
  LIMITED_A UNLIMITED_A LIMITED_B UNLIMITED_B LIMITED_C UNLIMITED_C NEW_LIMITED
1       100       25000       300         500         2           5        -100
2       200       50000       300         500        10          20         -10
  NEW_UNLIMITED
1          4900
2          2475

> df %>%
+   dynamic_mutate(c("UNLIMITED"), prefix = "NEW")
  LIMITED_A UNLIMITED_A LIMITED_B UNLIMITED_B LIMITED_C UNLIMITED_C NEW_UNLIMITED
1       100       25000       300         500         2           5          4900
2       200       50000       300         500        10          20          2475

> df %>% 
+   dynamic_mutate(c("A", "B", "C"), "LIMITED_{x} + UNLIMITED_{x}")
  LIMITED_A UNLIMITED_A LIMITED_B UNLIMITED_B LIMITED_C UNLIMITED_C FINAL_A FINAL_B FINAL_C
1       100       25000       300         500         2           5   25100     800       7
2       200       50000       300         500        10          20   50200     800      30

Notes:

This approach uses lapply and glue to construct expressions from prefixes extracted using gsub (or you can supply your own prefixes/suffixes). parse_quosure from rlang is then used to parse the expression into a quosure. As a result, expr_list is a named list of quosure's which I can then use !!! to unquote and splice the arguments into separate expressions in mutate.

You can change the formula by adjusting the expression argument as shown in the last example.

The advantage of this method is that it is quite fast because I am mainly manipulating column names and creating strings (expressions). The disadvantage is that it uses multiple packages.




回答2:


One idea is to convert the data frame to long format and conduct the calculation.

library(dplyr)
library(tidyr)

df2 <- df %>%
  mutate(ID = 1:n()) %>%
  gather(Type, Value, -ID) %>%
  separate(Type, into = c("Type", "Group")) %>%
  spread(Group, Value) %>%
  mutate(Final = (A - B)/C)
df2
  ID      Type     A   B  C Final
1  1   LIMITED   100 300  2  -100
2  1 UNLIMITED 25000 500  5  4900
3  2   LIMITED   200 300 10   -10
4  2 UNLIMITED 50000 500 20  2475

And you can always convert the data frame back to the wide format.

df3 <- df2 %>%
  gather(Group, Value, A:Final) %>%
  unite(Col, Type, Group) %>%
  spread(Col, Value) %>%
  select(colnames(df), 
         FINAL_LIMITED = LIMITED_Final, 
         FINAL_UNLIMITED = UNLIMITED_Final)
  LIMITED_A UNLIMITED_A LIMITED_B UNLIMITED_B LIMITED_C UNLIMITED_C FINAL_LIMITED FINAL_UNLIMITED
1       100       25000       300         500         2           5          -100            4900
2       200       50000       300         500        10          20           -10            2475


来源:https://stackoverflow.com/questions/46832436/dplyr-mutate-formula-based-on-similarities-in-column-names

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!