问题
I am struggling with some encoding issues. I have many textfiles that contain rows in the following format:
https://dl.dropboxusercontent.com/u/94114397/example.txt
According to Notepad++, these are all encoded in UTF-8 and most non-ASCII characters are displayed correctly, as you can see in lines 1 and 2. However, I have problems with some characters that seem to be wrongly interpreted(?). In my example file, this the case in line 3 in the word "Lakuic", where there should be an "š" between the "u" and the "i". There actually is a character between those two letters which can be seen by copy-pasting the word into the google chrome address bar.
Now when I read the data in R, it displays "Laku< U+009A>ic". How can I resolve this?
回答1:
Try converting from UTF-8 to latin1:
df <- read.table("http://dl.dropboxusercontent.com/u/94114397/example.txt", sep = "\t", row.names = 1, stringsAsFactors = FALSE, encoding="UTF-8")
iconv(df[, 1], from = "UTF-8", to = "latin1")
# [1] "Trichocentrum<->longifolium<-><->(Lindl.) R.Jiménez, Acta Bot. Mex. 97: 54 (2011)."
# [2] "Salvia<->× hegelmaieri<->nothosubsp. accidentalis<->(Sánchez-Gómez & R.Morales)."
# [3] "Edraianthus<->tarae<-><->Lakušic, Bilten Drustva Ekologa BiH, Ser. A 4: 108 (1987)."
My sessioInfo():
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 7 x64 (build 7601) Service Pack 1
#
# locale:
# [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252 LC_NUMERIC=C LC_TIME=German_Germany.1252
回答2:
This works for me:
file1 <- "https://dl.dropboxusercontent.com/u/94114397/example.txt"
result <- read.table(file1, header=F, sep="\t", quote="\"",encoding="windows-1252")
来源:https://stackoverflow.com/questions/30595862/r-encoding-utf-8-u0080-u009f