CPU-and-memory efficient NGram extraction with R

I wrote an algorithm which extract NGrams (bigrams, trigrams, ... till 5-grams) from a list of 50000 street addresses. My goal is to have for each address a boolean vector representing whether the NGrams are present or not in the address. Therefor each address will be characterized by a vector of attributes, and then I can carry out a clustering on the addresses. The algo works that way : I start with the bi-grams, I calculate all the combinations of (a-z and 0-9 and / and tabulation) : for example : aa,ab,ac,...,a8,a9,a/,a ,ba,bb,... Then I carry out a loop for each address and extract for all the bigrams the information 0 or 1 (bi-gram not present or present). Afterward, I calculate for the bigrams that occur the most the trigrams. And so on ... My problem is the time that the algo takes to run. Another problem : R reach its maximal capacity when there are more than 10000 NGrams. It's obvious because a 50000*10000 matrice is huge. I need your ideas to optimize the algo or to change it. Thank you.

Try the quanteda package, using this method. If you just want tokenized texts, replace the dfm( with tokenize(.

I'd be very interested to know how it works on your 50,000 street addresses. We've put a lot of effort into making dfm() very fast and robust.

myDfm <- dfm(c("1780 wemmel", "2015 schlemmel"), what = "character", 
             ngram = 1:5, concatenator = "", 
             removePunct = FALSE, removeNumbers = FALSE, 
             removeSeparators = FALSE, verbose = FALSE)
t(myDfm) # for easier viewing
#         docs
# features text1 text2
#           1     1
# s         0     1
# sc        0     1
# sch       0     1
# schl      0     1
# w         1     0
# we        1     0
# wem       1     0
# wemm      1     0
# 0         1     1
# 0         1     0
# 0 w       1     0
# 0 we      1     0
# 0 wem     1     0
# 01        0     1
# 015       0     1
# 015       0     1
# 015 s     0     1
# 1         1     1
# 15        0     1
# 15        0     1
# 15 s      0     1
# 15 sc     0     1
# 17        1     0
# 178       1     0
# 1780      1     0
# 1780      1     0
# 2         0     1
# 20        0     1
# 201       0     1
# 2015      0     1
# 2015      0     1
# 5         0     1
# 5         0     1
# 5 s       0     1
# 5 sc      0     1
# 5 sch     0     1
# 7         1     0
# 78        1     0
# 780       1     0
# 780       1     0
# 780 w     1     0
# 8         1     0
# 80        1     0
# 80        1     0
# 80 w      1     0
# 80 we     1     0
# c         0     1
# ch        0     1
# chl       0     1
# chle      0     1
# chlem     0     1
# e         2     2
# el        1     1
# em        1     1
# emm       1     1
# emme      1     1
# emmel     1     1
# h         0     1
# hl        0     1
# hle       0     1
# hlem      0     1
# hlemm     0     1
# l         1     2
# le        0     1
# lem       0     1
# lemm      0     1
# lemme     0     1
# m         2     2
# me        1     1
# mel       1     1
# mm        1     1
# mme       1     1
# mmel      1     1
# s         0     1
# sc        0     1
# sch       0     1
# schl      0     1
# schle     0     1
# w         1     0
# we        1     0
# wem       1     0
# wemm      1     0
# wemme     1     0

Some of these problems are, to an extent, already solved by the tm library and RWeka (for n-gram tokenization). Have a look at those, they might make your task easier.

For running out of memory I see two options:

tm uses sparse matrices, which are an efficient way of storing matrices with many zero elements.
You could also look at the bigmemory package. Although, I've never used it http://cran.r-project.org/web/packages/bigmemory/index.html

There's lots of ways of speeding up R code. Here's a guide to some of the ways of doing it: http://www.r-bloggers.com/faster-higher-stonger-a-guide-to-speeding-up-r-code-for-busy-people/

来源：https://stackoverflow.com/questions/31424687/cpu-and-memory-efficient-ngram-extraction-with-r

标签

performance

text-mining

n-gram