问题
I want to remove multiple patterns from multiple character vectors. Currently I am going:
a.vector <- gsub("@\\w+", "", a.vector)
a.vector <- gsub("http\\w+", "", a.vector)
a.vector <- gsub("[[:punct:]], "", a.vector)
etc etc.
This is painful. I was looking at this question & answer: R: gsub, pattern = vector and replacement = vector but it's not solving the problem.
Neither the mapply
nor the mgsub
are working. I made these vectors
remove <- c("@\\w+", "http\\w+", "[[:punct:]]")
substitute <- c("")
Neither mapply(gsub, remove, substitute, a.vector)
nor mgsub(remove, substitute, a.vector) worked.
a.vector
looks like this:
[4951] "@karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"
[4952] "@stiphan: you are phenomenal.. #mental #Writing. httptxjwufmfg"
I want:
[4951] "Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"
[4952] "you are phenomenal #mental #Writing" `
回答1:
I know this answer is late on the scene but it stems from my dislike of having to manually list the removal patterns inside the grep
functions (see other solutions here). My idea is to set the patterns beforehand, retain them as a character vector, then paste them (i.e. when "needed") using the regex
seperator "|"
:
library(stringr)
remove <- c("@\\w+", "http\\w+", "[[:punct:]]")
a.vector <- str_remove_all(a.vector, paste(remove, collapse = "|"))
Yes, this does effectively do the same as some of the other answers here, but I think my solution allows you to retain the original "character removal vector" remove
.
回答2:
Try combining your subpatterns using |
. For example
>s<-"@karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"
> gsub("@\\w+|http\\w+|[[:punct:]]", "", s)
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"
But this could become problematic if you have a large number of patterns, or if the result of applying one pattern creates matches to others.
Consider creating your remove
vector as you suggested, then applying it in a loop
> s1 <- s
> remove<-c("@\\w+","http\\w+","[[:punct:]]")
> for (p in remove) s1 <- gsub(p, "", s1)
> s1
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"
This approach will need to be expanded to apply it to the entire table or vector, of course. But if you put it into a function which returns the final string, you should be able to pass that to one of the apply
variants
回答3:
In case the multiple patterns that you are looking for are fixed and don't change from case-to-case, you can consider creating a concatenated regex that combines all of the patterns into one uber regex pattern.
For the example you provided, you can try:
removePat <- "(@\\w+)|(http\\w+)|([[:punct:]])"
a.vector <- gsub(removePat, "", a.vector)
回答4:
I had a vector with statement "my final score" and I wanted to keep on the word final and remove the rest. This what worked for me based on Marian suggestion:
str_remove_all("my final score", "my |score")
note: "my final score" is just an example. I was dealing with a vector.
来源:https://stackoverflow.com/questions/29036960/remove-multiple-patterns-from-text-vector-r