R grepl: quickly match multiple strings against multiple substrings, returning all matches

被刻印的时光 ゝ 提交于 2019-12-22 09:29:56

问题


I have a fairly large set of strings in R:

set.seed(42)
strings <- sapply(1:250000, function(x) sample(2:20, 1, prob=c(
  0.001, 0.006, 0.021, 0.043, 0.075, 0.101, 0.127, 
  0.138, 0.132, 0.111, 0.087, 0.064, 0.042, 0.025, 0.014, 0.008, 
  0.004, 0.002, 0.001)))
strings <- lapply(strings, function(x) sample(letters, x, replace=TRUE))
strings <- sapply(strings, paste, collapse='')

I would like to make a list denoting the presence or absence of each element from a list of substrings within these strings. My starting point, of course, is some code from stackoverflow:

#0.1 seconds
substrings <- sample(strings, 10)
system.time(matches <- lapply(substrings, grepl, strings, fixed=TRUE)) 

However, this approach is somewhat naive for larger sets of substrings, as it stores all of the matches and all of the non-matches:

#13 seconds
substrings <- sample(strings, 1000)
system.time(matches <- lapply(substrings, grepl, strings, fixed=TRUE)) 

We can reduce the size of the output object by only storing the matches:

#13 seconds
substrings <- sample(strings, 1000)
system.time(matches <- lapply(substrings, function(x) which(grepl(x, strings, fixed=TRUE))))

But this is still slow for large numbers of substrings:

#316 seconds
substrings <- sample(strings, 25000)
system.time(matches <- lapply(substrings, function(x) which(grepl(x, strings, fixed=TRUE))))

It's nice that the time is growing linearly, but I feel like there has to be a much faster way to accomplish this task, perhaps by avoiding the lapply loop.

How can I speed up this many-to-many string matching function?

/edit: One easy speedup is parallelization:

#Takes about 99 seconds
require('doParallel')
cl <- makeForkCluster(nnodes=8)
registerDoParallel(cl)
system.time(matches <- foreach(i=1:length(substrings)) %dopar% {
  which(grepl(substrings[i], strings, fixed=TRUE))
})
stopCluster(cl)

However, I think most solutions to this problem will be easy to parallelize, once a fast serial algorithm has been found.

来源:https://stackoverflow.com/questions/22489996/r-grepl-quickly-match-multiple-strings-against-multiple-substrings-returning-a

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!