发表新帖

Extract text between certain symbols using Regular Expression in R

后端未结

关注

 5  1313

I have a series of expressions such as:

\"the text I need to extract

\"

I need to extrac

相关标签:

5条回答

独厮守ぢ

2020-12-05 21:10
If there is only one ... as in the example then match everything up to  and everything from  forward and replace them both with the empty string:
```
x <- "the text I need to extract</a></div>"
gsub(".*|.*", "", x)
```
giving:
```
[1] "the text I need to extract"
```
If there could be multiple occurrences in the same string then try:
```
library(gsubfn)
strapplyc(x, "(.*?)", simplify = c)
```
giving the same in this example.
0 讨论(0)
发布评论:

提交评论
- 加载中...
轮回少年

2020-12-05 21:11
This approach uses a package I maintain qdapRegex that isn't regex but may be of use to you or future searchers. The function rm_between allows the user to extract text between a left and right bound and optionally include them. This approach is easy in that you don't have to think of a specific regex, just the exact left and right boundaries:
```
library(qdapRegex)

x <- "the text I need to extract</a></div>"

rm_between(x, "", "", extract=TRUE)

## [[1]]
## [1] "the text I need to extract"
```
I would point out that it may be more reliable to use an html parser for this job.
0 讨论(0)
发布评论:

提交评论
- 加载中...
不知归路

2020-12-05 21:29
```
((?:(?!<\/i>).)*)<\/i>
```
This should do it for you.
0 讨论(0)
发布评论:

提交评论
- 加载中...

醉酒成梦

2020-12-05 21:35

You can use the following approach with gregexpr and regmatches if you don't know the number of matches in a string.

vec <- c("<i>the text I need to extract</i></b></a></div>",
         "abc <i>another text</i> def <i>and another text</i> ghi")

regmatches(vec, gregexpr("(?<=<i>).*?(?=</i>)", vec, perl = TRUE))
# [[1]]
# [1] "the text I need to extract"
# 
# [[2]]
# [1] "another text"     "and another text"

0 讨论(0)

不思量自难忘°

2020-12-05 21:35
If this is html (which it look like it is) you should probably use an html parser. Package XML can do this
```
library(XML)
x <- "the text I need to extract</a></div>"
xmlValue(getNodeSet(htmlParse(x), "//i")[[1]])
# [1] "the text I need to extract"
```
On an entire html document, you can use
```
doc <- htmlParse(x)
sapply(getNodeSet(doc, "//i"), xmlValue)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题