How to detect more than one regex in a case_when statement

﹥>﹥吖頭↗ 提交于 2021-02-17 02:28:06

问题


I have recently converted from ifelse to case_when from dplyr.

Aim

I would like to be able to detect more than one regex from a statement in a dataframe using case_when as follows:

Input

statement<-data.frame(statement = c("I have performed APC and RFA",
 "An EMR was done","I didn't do anything"),stringsAsFactors=FALSE)

Desired output

statement                            out

I have performed APC and RFA        APC,RFA
An EMR was done                     EMR
I didn't do anything                No Event

Attempt

library(dplyr)
library(stringr)

      dataframe <- 
        dataframe %>% 
        mutate(
          EVENT = case_when(
            str_detect(statement,"EMR") ~ "EMR", 
            str_detect(statement, "HALO|RFA") ~ "RFA", 
            str_detect(statement, "APC") ~ "APC", 
             TRUE ~ "No Event"
          )
        )

The problem

This gives only one output per statement rather than multiple output if multiple strings are present. Is there a way to detect multiple strings?


回答1:


1) gsubfn::strapply strapply can do the extraction and translation all at once. strapply will, for each component of stmt, match the pattern pat to it and all matches will be translated using L and then returned. The empty argument defines what is returned for components of stat having no matches. This gives a list of matches, one list component per row, on which toString is applied to convert each to a comma separated character string. This is the shortest of the 3 alternatives presented here.

library(gsubfn)

L <- list(APC = "APC", EMR = "EMR", HALO = "RFA", RFA = "RFA")
pat <- paste(names(L), collapse = "|")
transform(statement, 
  out = sapply(strapply(stmt, pat, L, empty = "No Event"), toString),
  stringsAsFactors = FALSE)

giving:

                          stmt      out
1 I have performed APC and RFA APC, RFA
2              An EMR was done      EMR
3         I didn't do anything No Event

2) Base R Using L and pat from above, create a function which takes a character vector of words x and extracts out those words matched by pat into g. If g has non-zero length translate its elements using L and compress it into a single string using toString; otherwise, return No Event.

Now split each element of stmt into words using strsplit and apply process to each such character vector.

process <- function(x) {
  g <- grep(pat, x, value = TRUE)
  if (length(g)) toString(L[g]) else "No Event"
}
transform(statement, out = sapply(strsplit(stmt, "\\s+"), process),
  stringsAsFactors = FALSE)

3) dplyr/tidyr Using L from (1) group by row number and stmt and separate the words into separate rows. Filter out those words in names(L) and collapse all rows in one stmt group translating through L and using toString to generate a comma separated string. Drop the n column. At this point we have the desired result except that No Event rows are still missing so right join what we have with statement and replace NAs with No Event.

library(dplyr)
library(tidyr)

statement %>%
  group_by(n = 1:n(), out = stmt) %>%
  separate_rows(out) %>%
  filter(out %in% names(L)) %>%
  summarize(stmt = stmt[1], out = toString(L[out])) %>%
  ungroup %>%
  select(-n) %>%
  right_join(statement, by = "stmt") %>%
  mutate(out = if_else(is.na(out), "No Event", out))

giving:

# A tibble: 3 x 2
  stmt                         out     
  <chr>                        <chr>   
1 I have performed APC and RFA APC, RFA
2 An EMR was done              EMR     
3 I didn't do anything         No Event

Note

We used this as the input:

statement <- structure(list(stmt = c("I have performed APC and RFA", 
  "An EMR was done", "I didn't do anything")), 
  class = "data.frame", row.names = c(NA, -3L))

Updates

Have revised a number of times after re-reading the question. Also added more alternatives.




回答2:


The logic of case_when is that it doesn't execute the remaining conditions once a condition is met, so you can't actually get two outputs from a case_when statement. So if you want to use case_when it is advised to start with a least common condition and then slowly keep on making it more general. (hence, TRUE is the last condition)

If you want to stick with case_when you can add an additional condition and check for both the cases separately and give output accordingly.

library(dplyr)

statement %>% 
     mutate(
     EVENT = case_when(
           str_detect(statement, "APC") & str_detect(statement, "RFA") ~ "APC,RFA",
           str_detect(statement,"EMR") ~ "EMR", 
           str_detect(statement, "HALO|RFA") ~ "RFA", 
           str_detect(statement, "APC") ~ "APC", 
           TRUE ~ "No Event"
            )
           )



#                     statement    EVENT
#1 I have performed APC and RFA  APC,RFA
#2              An EMR was done      EMR
#3         I didn't do anything No Event
#4                        FALSE No Event



回答3:


An idea via base R is to extract the words with all upper case letters, and paste the ones that are bigger than 1 character, i.e.,

sapply(regmatches(statement$statement, gregexpr('\\b[A-Z]+\\b', statement$statement)), 
                                          function(i) {
                                                      v1 <- i[nchar(i) > 1];
                                                      toString(v1)
                                                      })


#[1] "APC, RFA" "EMR"      ""



回答4:


I don't think case_when is the best way to go. I think it depends a little bit on how many mappings like "HALO|RFA" you have. If there are many, it might be worth the time to write a proper function. However, if it is just this one, it might be faster to put together a pipe using dplyr verbs.

I would suggest using str_extract_all and unnest to get a tidy data frame with the relevant verbs and then use str_replace_all to resolve the mapping. In the end, I would use unique to make sure that we have no duplicate rows from the replacements.

Notice that the first column with APC, RFA will be split in two. I realize that this is not what you asked but it will make susequent processing in the tidyverse much easier. See this link for more on tidy data conventions: https://tidyr.tidyverse.org/articles/tidy-data.html

In it's current implementation, unnest will drop the last row where no pattern was matched. If you would like a NA instead, you can perform a full join with the original data. See also https://github.com/tidyverse/tidyr/issues/358

statement<-data.frame(statement = c("I have performed APC and RFA","An EMR was done","I didn't do anything"),stringsAsFactors=FALSE)
library(tidyverse)
statement %>% mutate(
  out = statement %>% 
    str_extract_all("((EMR)|(HALO)|(RFA)|(APC))")
  ) %>%  unnest(.drop =FALSE) %>% 
  mutate(
    out = out %>% str_replace_all("HALO", "RFA")
  ) %>% 
  unique() %>% 
  full_join(statement)

The output will be

                 statement    out
I have performed APC and RFA  APC
I have performed APC and RFA  RFA
An EMR was done               EMR
I didn't do anything          <NA>


来源:https://stackoverflow.com/questions/53851627/how-to-detect-more-than-one-regex-in-a-case-when-statement

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!