Big Text Corpus breaks tm_map

怎甘沉沦 提交于 2019-12-04 06:46:52
MHN

I found a solution that works.

Background/Debugging Steps

I tried several things that did not work:

  • Adding "content_transformer" to some tm_map, to all, to one(totower)
  • Adding "lazy = T" to tm_map
  • Tried some parallel computing packages

While it isn't working for 2 of my scripts, it works every time for a third script. But the code of all three scripts is the same only the size of the .rda file I am loading is different. The data structure is also identical for all three.

  • Dataset 1: Size - 493.3KB = error
  • Dataset 2: Size - 630.6KB = error
  • Dataset 3: Size - 300.2KB = works!

Just weird.

My sessionInfo() output:

R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)

locale:
[1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] snowfall_1.84-6    snow_0.3-13        Snowball_0.0-11    RWekajars_3.7.11-1 rJava_0.9-6              RWeka_0.4-23      
[7] slam_0.1-32        SnowballC_0.5.1    tm_0.6             NLP_0.1-5          twitteR_1.1.8      devtools_1.6      

loaded via a namespace (and not attached):
[1] bit_1.1-12     bit64_0.9-4    grid_3.1.2     httr_0.5       parallel_3.1.2 RCurl_1.95-4.3    rjson_0.2.14   stringr_0.6.2 
[9] tools_3.1.2

Solution

I just added this line after loading the data and everything works now:

MyCorpus <- tm_map(MyCorpus,
                     content_transformer(function(x) iconv(x, to='UTF-8-MAC', sub='byte')),
                     mc.cores=1)

Found the hint here: http://davetang.org/muse/2013/04/06/using-the-r_twitter-package/ (The author has updated his code due to the error on November 26, 2014.)

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!