How to gsub on the text between two words in R?

左心房为你撑大大i 提交于 2019-12-31 03:32:09

问题


EDIT:

I would like to place a \n before a specific unknown word in my text. I know that the first time the unknown word appears in my text will be between "Tree" and "Lake"

Ex. of text:

text
[1]  "TreeRULakeSunWater" 
[2]  "A B C D"

EDIT:

"Tree" and "Lake" will never change, but the word in between them is always changing so I do not look for "RU" in my regex

What I am currently doing:

if (grepl(".*Tree\\s*|Lake.*",  text)) { text <- gsub(".*Tree\\s*|Lake.*", "\n\\1", text)}

The problem with what I am doing above is that the gsub will sub all of text and leave just \nRU.

text
[1] "\nRU"

I have also tried:

if (grepl(".*Tree *(.*?) *Lake.*",  text)) { text <- gsub(".*Tree *(.*?) *Lake.*", "\n\\1", text)}

What I would like text to look like after gsub:

text
[1] "Tree \nRU LakeSunWater"
[2] "A B C D"

EDIT:

From Wiktor Stribizew's comment I am able to do a successful gsub

gsub("Tree(\\w+)Lake", "Tree \n\\1 Lake", text)

But this will only do a gsub on occurrences where "RU" is between "Tree and "Lake", which is the first occurrence of the unknown word. The unknown word and in this case "RU" will show up many times in the text, and I would like to place \n in front of every occurrence of "RU" when "RU" is a whole word.

New Ex. of text.

text
[1] "TreeRULakeSunWater"
[2] "A B C RU D"

New Ex. of what I would like:

text
[1] "Tree \nRU LakeSunWater"
[2] "A B C \nRU D"

Any help will be appreciated. Please let me know if further information is needed.


回答1:


You need to find the unknown word between "Tree" and "Lake" first. You can use

unknown_word <- gsub(".*Tree(\\w+)Lake.*", "\\1", text)

The pattern matches any characters up to the last Tree in a string, then captures the unknown word (\w+ = one or more word characters) up to the Lake and then matches the rest of the string. It replaces all the strings in the vector. You can access the first one by [[1]] index.

Then, when you know the word, replace it with

gsub(paste0("[[:space:]]*(", unknown_word[[1]], ")[[:space:]]*"), " \n\\1 ", text)

See IDEONE demo.

Here, you have [[:space:]]*( + unknown_word[1] + )[[:space:]]* pattern. It matches zero or more whitespaces on both ends of the unknown word, and the unknown word itself (captured into Group 1). In the replacement, the spaces are shrunk into 1 (or added if there were none) and then \\1 restores the unknown word. You may replace [[:space:]] with \\s.

UPDATE

If you need to only add a newline symbols before RU that are whole words, use the \b word boundary:

> gsub(paste0("[[:space:]]*\\b(", unknown_word[[1]], ")\\b[[:space:]]*"), " \n\\1 ", text)
[1] "TreeRULakeSunWater" "A B C \nRU D"   


来源:https://stackoverflow.com/questions/35505036/how-to-gsub-on-the-text-between-two-words-in-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!