CPU-and-memory efficient NGram extraction with R

雨燕双飞 提交于 2019-12-01 11:20:26

Try the quanteda package, using this method. If you just want tokenized texts, replace the dfm( with tokenize(.

I'd be very interested to know how it works on your 50,000 street addresses. We've put a lot of effort into making dfm() very fast and robust.

myDfm <- dfm(c("1780 wemmel", "2015 schlemmel"), what = "character", 
             ngram = 1:5, concatenator = "", 
             removePunct = FALSE, removeNumbers = FALSE, 
             removeSeparators = FALSE, verbose = FALSE)
t(myDfm) # for easier viewing
#         docs
# features text1 text2
#           1     1
# s         0     1
# sc        0     1
# sch       0     1
# schl      0     1
# w         1     0
# we        1     0
# wem       1     0
# wemm      1     0
# 0         1     1
# 0         1     0
# 0 w       1     0
# 0 we      1     0
# 0 wem     1     0
# 01        0     1
# 015       0     1
# 015       0     1
# 015 s     0     1
# 1         1     1
# 15        0     1
# 15        0     1
# 15 s      0     1
# 15 sc     0     1
# 17        1     0
# 178       1     0
# 1780      1     0
# 1780      1     0
# 2         0     1
# 20        0     1
# 201       0     1
# 2015      0     1
# 2015      0     1
# 5         0     1
# 5         0     1
# 5 s       0     1
# 5 sc      0     1
# 5 sch     0     1
# 7         1     0
# 78        1     0
# 780       1     0
# 780       1     0
# 780 w     1     0
# 8         1     0
# 80        1     0
# 80        1     0
# 80 w      1     0
# 80 we     1     0
# c         0     1
# ch        0     1
# chl       0     1
# chle      0     1
# chlem     0     1
# e         2     2
# el        1     1
# em        1     1
# emm       1     1
# emme      1     1
# emmel     1     1
# h         0     1
# hl        0     1
# hle       0     1
# hlem      0     1
# hlemm     0     1
# l         1     2
# le        0     1
# lem       0     1
# lemm      0     1
# lemme     0     1
# m         2     2
# me        1     1
# mel       1     1
# mm        1     1
# mme       1     1
# mmel      1     1
# s         0     1
# sc        0     1
# sch       0     1
# schl      0     1
# schle     0     1
# w         1     0
# we        1     0
# wem       1     0
# wemm      1     0
# wemme     1     0

Some of these problems are, to an extent, already solved by the tm library and RWeka (for n-gram tokenization). Have a look at those, they might make your task easier.

For running out of memory I see two options:

  1. tm uses sparse matrices, which are an efficient way of storing matrices with many zero elements.

  2. You could also look at the bigmemory package. Although, I've never used it http://cran.r-project.org/web/packages/bigmemory/index.html

There's lots of ways of speeding up R code. Here's a guide to some of the ways of doing it: http://www.r-bloggers.com/faster-higher-stonger-a-guide-to-speeding-up-r-code-for-busy-people/

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!