Extract text between certain symbols using Regular Expression in R

后端 未结 5 1307
刺人心
刺人心 2020-12-05 20:35

I have a series of expressions such as:

\"the text I need to extract
\"

I need to extrac

相关标签:
5条回答
  • 2020-12-05 21:10

    If there is only one <i>...</i> as in the example then match everything up to <i> and everything from </i> forward and replace them both with the empty string:

    x <- "<i>the text I need to extract</i></b></a></div>"
    gsub(".*<i>|</i>.*", "", x)
    

    giving:

    [1] "the text I need to extract"
    

    If there could be multiple occurrences in the same string then try:

    library(gsubfn)
    strapplyc(x, "<i>(.*?)</i>", simplify = c)
    

    giving the same in this example.

    0 讨论(0)
  • 2020-12-05 21:11

    This approach uses a package I maintain qdapRegex that isn't regex but may be of use to you or future searchers. The function rm_between allows the user to extract text between a left and right bound and optionally include them. This approach is easy in that you don't have to think of a specific regex, just the exact left and right boundaries:

    library(qdapRegex)
    
    x <- "<i>the text I need to extract</i></b></a></div>"
    
    rm_between(x, "<i>", "</i>", extract=TRUE)
    
    ## [[1]]
    ## [1] "the text I need to extract"
    

    I would point out that it may be more reliable to use an html parser for this job.

    0 讨论(0)
  • 2020-12-05 21:29
    <i>((?:(?!<\/i>).)*)<\/i>
    

    This should do it for you.

    0 讨论(0)
  • 2020-12-05 21:35

    You can use the following approach with gregexpr and regmatches if you don't know the number of matches in a string.

    vec <- c("<i>the text I need to extract</i></b></a></div>",
             "abc <i>another text</i> def <i>and another text</i> ghi")
    
    regmatches(vec, gregexpr("(?<=<i>).*?(?=</i>)", vec, perl = TRUE))
    # [[1]]
    # [1] "the text I need to extract"
    # 
    # [[2]]
    # [1] "another text"     "and another text"
    
    0 讨论(0)
  • 2020-12-05 21:35

    If this is html (which it look like it is) you should probably use an html parser. Package XML can do this

    library(XML)
    x <- "<i>the text I need to extract</i></b></a></div>"
    xmlValue(getNodeSet(htmlParse(x), "//i")[[1]])
    # [1] "the text I need to extract"
    

    On an entire html document, you can use

    doc <- htmlParse(x)
    sapply(getNodeSet(doc, "//i"), xmlValue)
    
    0 讨论(0)
提交回复
热议问题