R grep: Match one string against multiple patterns

前端 未结 3 1708
长发绾君心
长发绾君心 2020-12-04 18:01

In R, grep usually matches a vector of multiple strings against one regexp.

Q: Is there a possibility to match a single string against multiple regexps? (wi

相关标签:
3条回答
  • 2020-12-04 18:42

    What about applying the regexpr function over a vector of keywords?

    keywords <- c("dog", "cat", "bird")
    
    strings <- c("Do you have a dog?", "My cat ate by bird.", "Let's get icecream!")
    
    sapply(keywords, regexpr, strings, ignore.case=TRUE)
    
         dog cat bird
    [1,]  15  -1   -1
    [2,]  -1   4   15
    [3,]  -1  -1   -1
    
        sapply(keywords, regexpr, strings[1], ignore.case=TRUE)
    
     dog  cat bird 
      15   -1   -1 
    

    Values returned are the position of the first character in the match, with -1 meaning no match.

    If the position of the match is irrelevant, use grepl instead:

    sapply(keywords, grepl, strings, ignore.case=TRUE)
    
           dog   cat  bird
    [1,]  TRUE FALSE FALSE
    [2,] FALSE  TRUE  TRUE
    [3,] FALSE FALSE FALSE
    

    Update: This runs relatively quick on my system, even with a large number of keywords:

    # Available on most *nix systems
    words <- scan("/usr/share/dict/words", what="")
    length(words)
    [1] 234936
    
    system.time(matches <- sapply(words, grepl, strings, ignore.case=TRUE))
    
       user  system elapsed 
      7.495   0.155   7.596 
    
    dim(matches)
    [1]      3 234936
    
    0 讨论(0)
  • 2020-12-04 18:57

    re2r package can match multiple patterns (in parallel). Minimal example:

    # compile patterns
    re <- re2r::re2(keywords)
    # match strings
    re2r::re2_detect(strings, re, parallel = TRUE)
    
    0 讨论(0)
  • 2020-12-04 19:02

    To expand on the other answer, to transform the sapply() output into a useful logical vector you need to further use an apply() step.

    keywords <- c("dog", "cat", "bird")
    strings <- c("Do you have a dog?", "My cat ate by bird.", "Let's get icecream!")
    (matches <- sapply(keywords, grepl, strings, ignore.case=TRUE))
    #        dog   cat  bird
    # [1,]  TRUE FALSE FALSE
    # [2,] FALSE  TRUE  TRUE
    # [3,] FALSE FALSE FALSE
    

    To know which strings contain any of the keywords (patterns):

    apply(matches, 1, any)
    # [1]  TRUE  TRUE FALSE
    

    To know which keywords (patterns) were matched in the supplied strings:

    apply(matches, 2, any)
    #  dog  cat bird 
    # TRUE TRUE TRUE
    
    0 讨论(0)
提交回复
热议问题