I have been trying to implement the algorithm recently proposed in this paper. Given a large amount of text (corpus), the algorithm is supposed to return chara
The following runs in under 7 seconds on my machine, for all the bigrams:
library(dplyr)
res <- inner_join(token.df[[2]],token.df[[3]],by = c('w1','w2'))
res <- group_by(res,w1,w2)
bigrams <- filter(summarise(res,keep = all(mi.y < mi.x)),keep)
There's nothing special about dplyr here. An equally fast (or faster) solution could surely be done using data.table or directly in SQL. You just need to switch to using joins (as in SQL) rather than iterating through everything yourself. In fact, I wouldn't be surprised if simply using merge in base R and then aggregate wouldn't be orders of magnitude faster than what you're doing now. (But you really should be doing this with data.table, dplyr or directly in a SQL data base).
Indeed, this:
library(data.table)
dt2 <- setkey(data.table(token.df[[2]]),w1,w2)
dt3 <- setkey(data.table(token.df[[3]]),w1,w2)
dt_tmp <- dt3[dt2,allow.cartesian = TRUE][,list(k = all(mi < mi.1)),by = c('w1','w2')][(k)]
is even faster still (~2x). I'm not even really sure that I've squeezed all the speed I could have out of either package, to be honest.
(edit from Rick. Attempted as comment, but syntax was getting messed up)
If using data.table, this should be even faster, as data.table has a by-without-by feature (See ?data.table for more info):
dt_tmp <- dt3[dt2,list(k = all(mi < i.mi)), allow.cartesian = TRUE][(k)]
Note that when joining data.tables you can preface the column names with i. to indicate to use the column from specifically the data.table in the i= argument.