问题
I have a list of words in R as shown below:
myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")
And I want to remove the words which are found in the above list from the text as below:
myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."
After removing the unwanted myList words, the myText should look like:
This is at Sample Text, which is better and cleaned, where is not equal to. This is messy text.
I was using :
stringr::str_replace_all(myText,"[^a-zA-Z\\s]", " ")
But this is not helping me. What I should do??
回答1:
You may use a PCRE regex with a gsub
base R function (it will also work with ICU regex in str_replace_all
):
\s*(?<!\w)(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00)(?!\w)
See the regex demo.
Details
\s*
- 0 or more whitespaces(?<!\w)
- a negative lookbehind that ensures there is no word char immediately before the current location(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00)
- a non-capturing group containing the escaped items inside the character vector with the words you need to remove(?!\w)
- a negative lookahead that ensures there is no word char immediately after the current location.
NOTE: We cannot use \b
word boundary here because the items in the myList
character vector may start/end with non-word characters while \b meaning is context-dependent.
See an R demo online:
myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")
myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."
escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) }
pat <- paste0("\\s*(?<!\\w)(?:", paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|"), ")(?!\\w)")
cat(pat, collapse="\n")
gsub(pat, "", myText, perl=TRUE)
## => [1] "This is Sample Text, which is better and cleaned, where is not equal to. This is messy text."
Details
escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) }
- escapes all special chars that need escaping in a PCRE patternpaste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|")
- creats a|
-separated alternative list from the search term vector.
回答2:
gsub(paste0(myList, collapse = "|"), "", myText)
gives:
[1] "This is Sample Text, which is better and cleaned , where is not equal to . This is messy text ."
来源:https://stackoverflow.com/questions/51174108/remove-a-list-of-whole-words-that-may-contain-special-chars-from-a-character-vec