Remove a list of whole words that may contain special chars from a character vector without matching parts of words

烈酒焚心 提交于 2019-12-02 14:01:31

问题


I have a list of words in R as shown below:

 myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")

And I want to remove the words which are found in the above list from the text as below:

 myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."

After removing the unwanted myList words, the myText should look like:

  This is at Sample Text, which is better and cleaned, where is not equal to. This is messy text.

I was using :

  stringr::str_replace_all(myText,"[^a-zA-Z\\s]", " ")

But this is not helping me. What I should do??


回答1:


You may use a PCRE regex with a gsub base R function (it will also work with ICU regex in str_replace_all):

\s*(?<!\w)(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00)(?!\w)

See the regex demo.

Details

  • \s* - 0 or more whitespaces
  • (?<!\w) - a negative lookbehind that ensures there is no word char immediately before the current location
  • (?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00) - a non-capturing group containing the escaped items inside the character vector with the words you need to remove
  • (?!\w) - a negative lookahead that ensures there is no word char immediately after the current location.

NOTE: We cannot use \b word boundary here because the items in the myList character vector may start/end with non-word characters while \b meaning is context-dependent.

See an R demo online:

myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")
myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."
escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) }
pat <- paste0("\\s*(?<!\\w)(?:", paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|"), ")(?!\\w)")
cat(pat, collapse="\n")
gsub(pat, "", myText, perl=TRUE)
## => [1] "This is Sample Text, which is better and cleaned, where is not equal to. This is messy text."

Details

  • escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) } - escapes all special chars that need escaping in a PCRE pattern
  • paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|") - creats a |-separated alternative list from the search term vector.



回答2:


gsub(paste0(myList, collapse = "|"), "", myText)

gives:

[1] "This is  Sample  Text, which  is  better and cleaned , where  is not equal to . This is messy text ."


来源:https://stackoverflow.com/questions/51174108/remove-a-list-of-whole-words-that-may-contain-special-chars-from-a-character-vec

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!