remove multiple patterns from text vector r

若如初见. 提交于 2019-12-08 07:04:33

问题


I want to remove multiple patterns from multiple character vectors. Currently I am going:

a.vector <- gsub("@\\w+", "", a.vector)
a.vector <- gsub("http\\w+", "", a.vector)
a.vector <- gsub("[[:punct:]], "", a.vector)

etc etc.

This is painful. I was looking at this question & answer: R: gsub, pattern = vector and replacement = vector but it's not solving the problem.

Neither the mapply nor the mgsub are working. I made these vectors

remove <- c("@\\w+", "http\\w+", "[[:punct:]]")
substitute <- c("")

Neither mapply(gsub, remove, substitute, a.vector) nor mgsub(remove, substitute, a.vector) worked.

a.vector looks like this:

[4951] "@karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
[4952] "@stiphan: you are phenomenal.. #mental #Writing. httptxjwufmfg"   

I want:

[4951] "Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
[4952] "you are phenomenal #mental #Writing"   `

回答1:


I know this answer is late on the scene but it stems from my dislike of having to manually list the removal patterns inside the grep functions (see other solutions here). My idea is to set the patterns beforehand, retain them as a character vector, then paste them (i.e. when "needed") using the regex seperator "|":

library(stringr)

remove <- c("@\\w+", "http\\w+", "[[:punct:]]")

a.vector <- str_remove_all(a.vector, paste(remove, collapse = "|"))

Yes, this does effectively do the same as some of the other answers here, but I think my solution allows you to retain the original "character removal vector" remove.




回答2:


Try combining your subpatterns using |. For example

>s<-"@karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"
> gsub("@\\w+|http\\w+|[[:punct:]]", "", s)
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"

But this could become problematic if you have a large number of patterns, or if the result of applying one pattern creates matches to others.

Consider creating your remove vector as you suggested, then applying it in a loop

> s1 <- s
> remove<-c("@\\w+","http\\w+","[[:punct:]]")
> for (p in remove) s1 <- gsub(p, "", s1)
> s1
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"

This approach will need to be expanded to apply it to the entire table or vector, of course. But if you put it into a function which returns the final string, you should be able to pass that to one of the apply variants




回答3:


In case the multiple patterns that you are looking for are fixed and don't change from case-to-case, you can consider creating a concatenated regex that combines all of the patterns into one uber regex pattern.

For the example you provided, you can try:

removePat <- "(@\\w+)|(http\\w+)|([[:punct:]])"

a.vector <- gsub(removePat, "", a.vector)



回答4:


I had a vector with statement "my final score" and I wanted to keep on the word final and remove the rest. This what worked for me based on Marian suggestion:

str_remove_all("my final score", "my |score")

note: "my final score" is just an example. I was dealing with a vector.



来源:https://stackoverflow.com/questions/29036960/remove-multiple-patterns-from-text-vector-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!