问题
I'm performing a simple command in R over a large dataset, and the result is slow and uses too much memory. Here's a an example using two rows, although my real dataset has 154 million rows:
library(data.table)
Dt<-data.table(title1=c("The coolest song ever",
"The greatest music in the world"),
title2=c("coolest song","greatest music"))
Dt$Match<-sapply(seq_len(nrow(Dt)), function(x) grepl(Dt$title2[x],Dt$title1[x]))
The result of Dt$Match should be TRUE, TRUE. Before running this script, I have about 12 Gb of RAM left, but as this slow code runs, memory is being used up.
Is there a more efficient way to get the same results? Perhaps leveraging the Data Table package?
回答1:
Use stringi
library, it's more performant.
stri_detect_fixed(Dt$title1, Dt$title2)
should be what you're looking for.
(thanks to Frank. Frank actually found the exact DT answer:
Dt[, stri_detect_fixed(title1, title2)]
The functions with suffix ..._fixed
are faster than the _regex
ones.
来源:https://stackoverflow.com/questions/33400250/memory-and-performance-using-grepl-on-large-data-table