R on Windows: character encoding hell

后端 未结 5 762
旧巷少年郎
旧巷少年郎 2020-11-29 23:32

I am trying to import a CSV encoded as OEM-866 (Cyrillic charset) into R on Windows. I also have a copy that has been converted into UTF-8 w/o BOM. Both of these files are r

5条回答
  •  萌比男神i
    2020-11-30 00:02

    There are two options for reading data from files containing characters unsupported by your current locale. You can change your locale as suggested by @user23676 or you can convert to UTF-8. The readr package provides replacements for read.table derived functions that perform this conversion for you. You can read the CP866 file with

    library(readr)
    oem.csv <- read_csv2('~/csv1.csv', locale = locale(encoding = 'CP866'))
    

    There is one little problem, which is that there is a bug in print.data.frame that results in columns with UTF-8 encoding to be displayed incorrectly on Windows. You can work around the bug with print.listof(oem.csv) or print(as.matrix(oem.csv)).

    I've discussed this in more detail in a blog post at http://people.fas.harvard.edu/~izahn/posts/reading-data-with-non-native-encoding-in-r/

提交回复
热议问题