Fast partial string matching in R

前端 未结 2 730
粉色の甜心
粉色の甜心 2020-12-14 09:16

Given a vector of strings texts and a vector of patterns patterns, I want to find any matching pattern for each text.

For small datasets, t

2条回答
  •  难免孤独
    2020-12-14 09:54

    Use stringi package - it's even faster than grepl. Check the benchmarks! I used text from @Martin-Morgan post

    require(stringi)
    require(microbenchmark)
    
    text = readLines("~/Desktop/pg100.txt")
    pattern <-  strsplit("all the world's a stage and all the people players", " ")[[1]]
    
    grepl_fun <- function(){
        lapply(pattern, grepl, text, fixed=TRUE)
    }
    
    stri_fixed_fun <- function(){
        lapply(pattern, function(x) stri_detect_fixed(text,x,NA))
    }
    
    #        microbenchmark(grepl_fun(), stri_fixed_fun())
    #    Unit: milliseconds
    #                 expr      min       lq   median       uq      max neval
    #          grepl_fun() 432.9336 435.9666 446.2303 453.9374 517.1509   100
    #     stri_fixed_fun() 213.2911 218.1606 227.6688 232.9325 285.9913   100
    
    # if you don't believe me that the results are equal, you can check :)
    xx <- grepl_fun()
    stri <- stri_fixed_fun()
    
    for(i in seq_along(xx)){
        print(all(xx[[i]] == stri[[i]]))
    }
    

提交回复
热议问题