I\'m trying to use the tm package in R to perform some text analysis. I tied the following:
require(tm)
dataSet <- Corpus(DirSource(\'tmp/\'))
dataSet <
I have just run afoul of this problem. By chance are you using a machine running OSX? I am and seem to have traced the problem to the definition of the character set that R is compiled against on this operating system (see https://stat.ethz.ch/pipermail/r-sig-mac/2012-July/009374.html)
What I was seeing is that using the solution from the FAQ
tm_map(yourCorpus, function(x) iconv(enc2utf8(x), sub = "byte"))
was giving me this warning:
Warning message:
it is not known that wchar_t is Unicode on this platform
This I traced to the enc2utf8
function. Bad news is that this is a problem with my underlying OS and not R.
So here is what I did as a work around:
tm_map(yourCorpus, function(x) iconv(x, to='UTF-8-MAC', sub='byte'))
This forces iconv to use the utf8 encoding on the macintosh and works fine without the need to recompile.