R Corpus Is Messing Up My UTF-8 Encoded Text

前端未结

关注

 3  1678

青春惊慌失措 2020-12-17 05:26

I am simply trying to create a corpus from Russian, UTF-8 encoded text. The problem is, the Corpus method from the tm package is not encoding the strings corre

3条回答

春和景丽 (楼主)

2020-12-17 06:16
Well, there seems to be good news and bad news.

The good news is that the data appears to be fine even if it doesn't display correctly with inspect(). Try looking at
```
content(corp[[2]])
# [1] "Складское помещение, 345 м²"
```
The reason it looks funny in inspect() is because the authors changed the way the print.PlainTextDocument function works. It formerly would cat the value to screen. Now, however, they feed the data though writeLines(). This function uses the locale of the system to format the characters/bytes in the document. (This can be viewed with Sys.getlocale()). It turns out Linux and OS X have a proper "UTF-8" encoding, but Windows uses language specific code pages. So if the characters aren't in the code page, they get escaped or translated to funny characters. This means this should work just fine on a Mac, but not on a PC.

Try going a step further and building a DocumentTermMatrix
```
dtm <- DocumentTermMatrix(corp)
Terms(dtm)
```
Hopefully you will see (as I do) the words correctly displayed.

If you like, this article about writing UTF-8 files on Windows has some more information about this OS specific issue. I see no easy way to get writeLines to output UTF-8 to stdout() on Windows. I'm not sure why the package maintainers changed the print method, but one might ask or submit a feature request to change it back.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...