Fast partial string matching in R

前端 未结 2 721
粉色の甜心
粉色の甜心 2020-12-14 09:16

Given a vector of strings texts and a vector of patterns patterns, I want to find any matching pattern for each text.

For small datasets, t

2条回答
  •  一整个雨季
    2020-12-14 09:51

    Have you accurately characterized your problem and the performance you're seeing? Here are the Complete Works of William Shakespeare and a query against them

    text = readLines("~/Downloads/pg100.txt")
    pattern <- 
        strsplit("all the world's a stage and all the people players", " ")[[1]]
    

    which seems to be much more performant than you imply?

    > length(text)
    [1] 124787
    > system.time(xx <- lapply(pattern, grepl, text, fixed=TRUE))
       user  system elapsed 
      0.444   0.001   0.444 
    ## avoid retaining memory; 500 x 500 case; no blank lines
    > text = text[nzchar(text)]
    > system.time({ for (p in rep(pattern, 50)) grepl(p, text[1:500], fixed=TRUE) })
       user  system elapsed 
      0.096   0.000   0.095 
    

    We're expecting linear scaling with both the length (number of elements) of pattern and text. It seems I mis-remember my Shakespeare

    > idx = Reduce("+", lapply(pattern, grepl, text, fixed=TRUE))
    > range(idx)
    [1] 0 7
    > sum(idx == 7)
    [1] 8
    > text[idx == 7]
    [1] "    And all the men and women merely players;"                       
    [2] "    cicatrices to show the people when he shall stand for his place."
    [3] "    Scandal'd the suppliants for the people, call'd them"            
    [4] "    all power from the people, and to pluck from them their tribunes"
    [5] "    the fashion, and so berattle the common stages (so they call"    
    [6] "    Which God shall guard; and put the world's whole strength"       
    [7] "    Of all his people and freeze up their zeal,"                     
    [8] "    the world's end after my name-call them all Pandars; let all"    
    

提交回复
热议问题