R tm package invalid input in 'utf8towcs'

前端 未结 14 1395
逝去的感伤
逝去的感伤 2020-11-29 01:47

I\'m trying to use the tm package in R to perform some text analysis. I tied the following:

require(tm)
dataSet <- Corpus(DirSource(\'tmp/\'))
dataSet <         


        
14条回答
  •  温柔的废话
    2020-11-29 01:59

    This is a common issue with the tm package (1, 2, 3).

    One non-R way to fix it is to use a text editor to find and replace all the fancy characters (ie. those with diacritics) in your text before loading it into R (or use gsub in R). For example you'd search and replace all instances of the O-umlaut in Öl-Teppich. Others have had success with this (I have too), but if you have thousands of individual text files obviously this is no good.

    For an R solution, I found that using VectorSource instead of DirSource seems to solve the problem:

    # I put your example text in a file and tested it with both ANSI and 
    # UTF-8 encodings, both enabled me to reproduce your problem
    #
    tmp <- Corpus(DirSource('C:\\...\\tmp/'))
    tmp <- tm_map(dataSet, tolower)
    Error in FUN(X[[1L]], ...) : 
      invalid input 'RT @noXforU Erneut riesiger (Alt-)Öl–teppich im Golf von Mexiko (#pics vom Freitag) http://bit.ly/bw1hvU http://bit.ly/9R7JCf #oilspill #bp' in 'utf8towcs'
    # quite similar error to what you got, both from ANSI and UTF-8 encodings
    #
    # Now try VectorSource instead of DirSource
    tmp <- readLines('C:\\...\\tmp.txt') 
    tmp
    [1] "RT @noXforU Erneut riesiger (Alt-)Öl–teppich im Golf von Mexiko (#pics vom Freitag) http://bit.ly/bw1hvU http://bit.ly/9R7JCf #oilspill #bp"
    # looks ok so far
    tmp <- Corpus(VectorSource(tmp))
    tmp <- tm_map(tmp, tolower)
    tmp[[1]]
    rt @noxforu erneut riesiger (alt-)öl–teppich im golf von mexiko (#pics vom freitag) http://bit.ly/bw1hvu http://bit.ly/9r7jcf #oilspill #bp
    # seems like it's worked just fine. It worked for best for ANSI encoding. 
    # There was no error with UTF-8 encoding, but the Ö was returned 
    # as ã– which is not good
    

    But this seems like a bit of a lucky coincidence. There must be a more direct way about it. Do let us know what works for you!

提交回复
热议问题