Fast partial string matching in R

前端未结

关注

 2  718

Given a vector of strings texts and a vector of patterns patterns, I want to find any matching pattern for each text.

For small datasets, t

相关标签:

2条回答

一整个雨季

2020-12-14 09:51

Have you accurately characterized your problem and the performance you're seeing? Here are the Complete Works of William Shakespeare and a query against them

text = readLines("~/Downloads/pg100.txt")
pattern <- 
    strsplit("all the world's a stage and all the people players", " ")[[1]]

which seems to be much more performant than you imply?

> length(text)
[1] 124787
> system.time(xx <- lapply(pattern, grepl, text, fixed=TRUE))
   user  system elapsed 
  0.444   0.001   0.444 
## avoid retaining memory; 500 x 500 case; no blank lines
> text = text[nzchar(text)]
> system.time({ for (p in rep(pattern, 50)) grepl(p, text[1:500], fixed=TRUE) })
   user  system elapsed 
  0.096   0.000   0.095

We're expecting linear scaling with both the length (number of elements) of pattern and text. It seems I mis-remember my Shakespeare

> idx = Reduce("+", lapply(pattern, grepl, text, fixed=TRUE))
> range(idx)
[1] 0 7
> sum(idx == 7)
[1] 8
> text[idx == 7]
[1] "    And all the men and women merely players;"                       
[2] "    cicatrices to show the people when he shall stand for his place."
[3] "    Scandal'd the suppliants for the people, call'd them"            
[4] "    all power from the people, and to pluck from them their tribunes"
[5] "    the fashion, and so berattle the common stages (so they call"    
[6] "    Which God shall guard; and put the world's whole strength"       
[7] "    Of all his people and freeze up their zeal,"                     
[8] "    the world's end after my name-call them all Pandars; let all"

0 讨论(0)

难免孤独

2020-12-14 09:54

Use stringi package - it's even faster than grepl. Check the benchmarks! I used text from @Martin-Morgan post

require(stringi)
require(microbenchmark)

text = readLines("~/Desktop/pg100.txt")
pattern <-  strsplit("all the world's a stage and all the people players", " ")[[1]]

grepl_fun <- function(){
    lapply(pattern, grepl, text, fixed=TRUE)
}

stri_fixed_fun <- function(){
    lapply(pattern, function(x) stri_detect_fixed(text,x,NA))
}

#        microbenchmark(grepl_fun(), stri_fixed_fun())
#    Unit: milliseconds
#                 expr      min       lq   median       uq      max neval
#          grepl_fun() 432.9336 435.9666 446.2303 453.9374 517.1509   100
#     stri_fixed_fun() 213.2911 218.1606 227.6688 232.9325 285.9913   100

# if you don't believe me that the results are equal, you can check :)
xx <- grepl_fun()
stri <- stri_fixed_fun()

for(i in seq_along(xx)){
    print(all(xx[[i]] == stri[[i]]))
}

0 讨论(0)