If there is only one <i>...</i>
as in the example then match everything up to <i>
and everything from </i>
forward and replace them both with the empty string:
x <- "<i>the text I need to extract</i></b></a></div>"
gsub(".*<i>|</i>.*", "", x)
giving:
[1] "the text I need to extract"
If there could be multiple occurrences in the same string then try:
library(gsubfn)
strapplyc(x, "<i>(.*?)</i>", simplify = c)
giving the same in this example.
This approach uses a package I maintain qdapRegex that isn't regex but may be of use to you or future searchers. The function rm_between
allows the user to extract text between a left and right bound and optionally include them. This approach is easy in that you don't have to think of a specific regex, just the exact left and right boundaries:
library(qdapRegex)
x <- "<i>the text I need to extract</i></b></a></div>"
rm_between(x, "<i>", "</i>", extract=TRUE)
## [[1]]
## [1] "the text I need to extract"
I would point out that it may be more reliable to use an html parser for this job.
<i>((?:(?!<\/i>).)*)<\/i>
This should do it for you.
You can use the following approach with gregexpr
and regmatches
if you don't know the number of matches in a string.
vec <- c("<i>the text I need to extract</i></b></a></div>",
"abc <i>another text</i> def <i>and another text</i> ghi")
regmatches(vec, gregexpr("(?<=<i>).*?(?=</i>)", vec, perl = TRUE))
# [[1]]
# [1] "the text I need to extract"
#
# [[2]]
# [1] "another text" "and another text"
If this is html (which it look like it is) you should probably use an html parser. Package XML
can do this
library(XML)
x <- "<i>the text I need to extract</i></b></a></div>"
xmlValue(getNodeSet(htmlParse(x), "//i")[[1]])
# [1] "the text I need to extract"
On an entire html document, you can use
doc <- htmlParse(x)
sapply(getNodeSet(doc, "//i"), xmlValue)