R tm package invalid input in 'utf8towcs'

前端 未结 14 1381
逝去的感伤
逝去的感伤 2020-11-29 01:47

I\'m trying to use the tm package in R to perform some text analysis. I tied the following:

require(tm)
dataSet <- Corpus(DirSource(\'tmp/\'))
dataSet <         


        
14条回答
  •  [愿得一人]
    2020-11-29 01:59

    I have just run afoul of this problem. By chance are you using a machine running OSX? I am and seem to have traced the problem to the definition of the character set that R is compiled against on this operating system (see https://stat.ethz.ch/pipermail/r-sig-mac/2012-July/009374.html)

    What I was seeing is that using the solution from the FAQ

    tm_map(yourCorpus, function(x) iconv(enc2utf8(x), sub = "byte"))
    

    was giving me this warning:

    Warning message:
    it is not known that wchar_t is Unicode on this platform 
    

    This I traced to the enc2utf8 function. Bad news is that this is a problem with my underlying OS and not R.

    So here is what I did as a work around:

    tm_map(yourCorpus, function(x) iconv(x, to='UTF-8-MAC', sub='byte'))
    

    This forces iconv to use the utf8 encoding on the macintosh and works fine without the need to recompile.

提交回复
热议问题