Extracting a string between other two strings in R

坚强是说给别人听的谎言 提交于 2019-11-26 04:27:02

问题


I am trying to find a simple way to extract an unknown substring (could be anything) that appear between two known substrings. For example, I have a string:

a<-\" anything goes here, STR1 GET_ME STR2, anything goes here\"

I need to extract the string GET_ME which is between STR1 and STR2 (without the white spaces).

I am trying str_extract(a, \"STR1 (.+) STR2\"), but I am getting the entire match

[1] \"STR1 GET_ME STR2\"

I can of course strip the known strings, to isolate the substring I need, but I think there should be a cleaner way to do it by using a correct regular expression.


回答1:


You may use str_match with STR1 (.*?) STR2 (note the spaces are "meaningful", if you want to just match anything in between STR1 and STR2 use STR1(.*?)STR2). If you have multiple occurrences, use str_match_all.

library(stringr)
a<-" anything goes here, STR1 GET_ME STR2, anything goes here"
res <- str_match(a, "STR1 (.*?) STR2")
res[,2]
[1] "GET_ME"

Another way using base R regexec (to get the first match):

test = " anything goes here, STR1 GET_ME STR2, anything goes here STR1 GET_ME2 STR2"
pattern="STR1 (.*?) STR2"
result <- regmatches(test,regexec(pattern,test))
result[[1]][2]
[1] "GET_ME"



回答2:


Here's another way by using base R

a<-" anything goes here, STR1 GET_ME STR2, anything goes here"

gsub(".*STR1 (.+) STR2.*", "\\1", a)

Output:

[1] "GET_ME"



回答3:


Another option is to use qdapRegex::ex_between to extract strings between left and right boundaries

qdapRegex::ex_between(a, "STR1", "STR2")[[1]]
#[1] "GET_ME"

It also works with multiple occurrences

a <- "anything STR1 GET_ME STR2, anything goes here, STR1 again get me STR2"

qdapRegex::ex_between(a, "STR1", "STR2")[[1]]
#[1] "GET_ME"       "again get me"

Or multiple left and right boundaries

a <- "anything STR1 GET_ME STR2, anything goes here, STR4 again get me STR5"
qdapRegex::ex_between(a, c("STR1", "STR4"), c("STR2", "STR5"))[[1]]
#[1] "GET_ME"       "again get me"

First capture is between "STR1" and "STR2" whereas second between "STR4" and "STR5".



来源:https://stackoverflow.com/questions/39086400/extracting-a-string-between-other-two-strings-in-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!