Memory and Performance using grepl on large data.table [duplicate]

你。 提交于 2020-01-04 05:27:12

问题


I'm performing a simple command in R over a large dataset, and the result is slow and uses too much memory. Here's a an example using two rows, although my real dataset has 154 million rows:

library(data.table)
Dt<-data.table(title1=c("The coolest song ever",
"The greatest music in the world"),
title2=c("coolest song","greatest music"))

Dt$Match<-sapply(seq_len(nrow(Dt)), function(x) grepl(Dt$title2[x],Dt$title1[x]))

The result of Dt$Match should be TRUE, TRUE. Before running this script, I have about 12 Gb of RAM left, but as this slow code runs, memory is being used up.

Is there a more efficient way to get the same results? Perhaps leveraging the Data Table package?


回答1:


Use stringi library, it's more performant.

stri_detect_fixed(Dt$title1, Dt$title2) should be what you're looking for.

(thanks to Frank. Frank actually found the exact DT answer:

Dt[, stri_detect_fixed(title1, title2)]

The functions with suffix ..._fixed are faster than the _regex ones.



来源:https://stackoverflow.com/questions/33400250/memory-and-performance-using-grepl-on-large-data-table

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!