Fast partial string matching in R

前端 未结 2 718
粉色の甜心
粉色の甜心 2020-12-14 09:16

Given a vector of strings texts and a vector of patterns patterns, I want to find any matching pattern for each text.

For small datasets, t

相关标签:
2条回答
  • 2020-12-14 09:51

    Have you accurately characterized your problem and the performance you're seeing? Here are the Complete Works of William Shakespeare and a query against them

    text = readLines("~/Downloads/pg100.txt")
    pattern <- 
        strsplit("all the world's a stage and all the people players", " ")[[1]]
    

    which seems to be much more performant than you imply?

    > length(text)
    [1] 124787
    > system.time(xx <- lapply(pattern, grepl, text, fixed=TRUE))
       user  system elapsed 
      0.444   0.001   0.444 
    ## avoid retaining memory; 500 x 500 case; no blank lines
    > text = text[nzchar(text)]
    > system.time({ for (p in rep(pattern, 50)) grepl(p, text[1:500], fixed=TRUE) })
       user  system elapsed 
      0.096   0.000   0.095 
    

    We're expecting linear scaling with both the length (number of elements) of pattern and text. It seems I mis-remember my Shakespeare

    > idx = Reduce("+", lapply(pattern, grepl, text, fixed=TRUE))
    > range(idx)
    [1] 0 7
    > sum(idx == 7)
    [1] 8
    > text[idx == 7]
    [1] "    And all the men and women merely players;"                       
    [2] "    cicatrices to show the people when he shall stand for his place."
    [3] "    Scandal'd the suppliants for the people, call'd them"            
    [4] "    all power from the people, and to pluck from them their tribunes"
    [5] "    the fashion, and so berattle the common stages (so they call"    
    [6] "    Which God shall guard; and put the world's whole strength"       
    [7] "    Of all his people and freeze up their zeal,"                     
    [8] "    the world's end after my name-call them all Pandars; let all"    
    
    0 讨论(0)
  • 2020-12-14 09:54

    Use stringi package - it's even faster than grepl. Check the benchmarks! I used text from @Martin-Morgan post

    require(stringi)
    require(microbenchmark)
    
    text = readLines("~/Desktop/pg100.txt")
    pattern <-  strsplit("all the world's a stage and all the people players", " ")[[1]]
    
    grepl_fun <- function(){
        lapply(pattern, grepl, text, fixed=TRUE)
    }
    
    stri_fixed_fun <- function(){
        lapply(pattern, function(x) stri_detect_fixed(text,x,NA))
    }
    
    #        microbenchmark(grepl_fun(), stri_fixed_fun())
    #    Unit: milliseconds
    #                 expr      min       lq   median       uq      max neval
    #          grepl_fun() 432.9336 435.9666 446.2303 453.9374 517.1509   100
    #     stri_fixed_fun() 213.2911 218.1606 227.6688 232.9325 285.9913   100
    
    # if you don't believe me that the results are equal, you can check :)
    xx <- grepl_fun()
    stri <- stri_fixed_fun()
    
    for(i in seq_along(xx)){
        print(all(xx[[i]] == stri[[i]]))
    }
    
    0 讨论(0)
提交回复
热议问题