Microbenchmarking base R and three packages on string pattern substitution

妖精的绣舞 提交于 2021-02-19 06:25:44

问题


My question is whether my method and conclusion are correct.

As part of my learning regular expressions, I wanted to figure out in which order to learn the various alternatives (base R and packages). I thought it might help to learn the relative speeds of the alternative functions. So, I created a string vector and called what I hope are equivalent expressions.

sites <- c("http://grand.test.com/", "https://example.com/",  
           "http://.big.time.bhfs.com/", "http://test.blogs.mvalaw.com/")
vec <- rep(x = sites, times = 1000) # creating a longish vector

base <- gsub("http:", "", vec, perl = TRUE)
stringr <- str_replace_all(vec, "http:", replacement = "")
stringi <- stri_replace_all_regex(str = vec, pattern = "http:", replacement = "")
qdap <- genX(text.var = vec, "http:", "")

Then I benchmarked the four methods using the microbenchmarking package.

test <- microbenchmark(base <- gsub("http:", "", vec, perl = TRUE),
                      stringr <- str_replace_all(vec, "http:", replacement = ""),
                      stringi <- stri_replace_all_regex(str = vec, pattern = "http:", replacement = ""),
                      qdap <- genX(text.var = vec, "http:", ""),
                      times = 100)

Am I correct that base R's gsub is by far the fastest (I shortened the expr names)?

 expr        min         lq
 base    1.697001   1.739393
 stringr 3.814348   3.928360
 stringi 5.888857   6.172212
 qdap 120.670037 124.624946
     median         uq        max neval
   1.765051   1.833770   2.976780   100
   3.979453   4.123138   7.032091   100
   6.276407   6.500412   7.634943   100
 127.493293 130.923663 173.155253   100

The median times are very significantly different, especially for qdap

来源:https://stackoverflow.com/questions/24846611/microbenchmarking-base-r-and-three-packages-on-string-pattern-substitution

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!