strange characters: interaction of R and Windows locale?

后端 未结 2 1819
[愿得一人]
[愿得一人] 2020-12-16 15:23

WinXP-x32, R-2.13.0

Dear list,

I have a problem that (I think) relates to the interaction between Windows and R.

I am trying to scrape a table with d

相关标签:
2条回答
  • 2020-12-16 16:07

    Unable to replicate the error, however looking at the help files is useful.

    Sys.setlocale("LC_TIME", "de")     # Solaris: details are OS-dependent
    Sys.setlocale("LC_TIME", "de_DE.utf8")   # Modern Linux etc.
    Sys.setlocale("LC_TIME", "de_DE.UTF-8")  # ditto
    Sys.setlocale("LC_TIME", "de_DE")  # OS X, in UTF-8
    Sys.setlocale("LC_TIME", "German") # Windows
    

    For a windows you should use formatting like "English" or "Dutch_Netherlands.1252" to change these settings.

    I tried to replicate your state

    > Sys.setlocale("LC_ALL","Dutch_Netherlands.1252")
    [1] "LC_COLLATE=Dutch_Netherlands.1252;LC_CTYPE=Dutch_Netherlands.1252;LC_MONETARY=Dutch_Netherlands.1252;LC_NUMERIC=C;LC_TIME=Dutch_Netherlands.1252"
    > Sys.getlocale()
    [1] "LC_COLLATE=Dutch_Netherlands.1252;LC_CTYPE=Dutch_Netherlands.1252;LC_MONETARY=Dutch_Netherlands.1252;LC_NUMERIC=C;LC_TIME=Dutch_Netherlands.1252"
    
    library(XML)
    u <- "http://en.wikipedia.org/wiki/Hawaii"
    tables <- readHTMLTable(u)
    Islands <- tables[[5]]
    

    However I do not get the funny characters in console, in my own locale the ʻ was marked as , but still all functionality remained.

    > Islands[1,1]
    [1] Hawaiʻi[27]
    8 Levels: Hawaiʻi[27] Kahoʻolawe[34] Kauaʻi[30] Lānaʻi[32] Maui[28] ... Oʻahu[29]
    

    And these funny characters can be read easily, and found from the table.

    > Encoding(as.character("Hawaiʻi"))
    [1] "UTF-8"
    > Encoding(as.character(Islands[1,1]))
    [1] "UTF-8"
    > grep("Hawaiʻi", as.character(Islands[1,1]))
    [1] 1
    

    If you still have problems it would rely elsewhere, however to change the locale under windows you have to use different names than Linux or OS X (see your own locale info for example). In Windows "Dutch" is probably enough.

    0 讨论(0)
  • 2020-12-16 16:11

    A not quite an answer:

    If you look at the wikipedia page and change the encoding in your browser (in IE, View -> Encoding; in Firefox, View -> Character Encoding) to Western (ISO-8869-1) or Western (Windows-1252) then you see the silly characters. That ought to mean that you can use iconv to change the encoding and fix your problems.

    #Convert factors to character
    Islands <- as.data.frame(lapply(Islands, as.character), stringsAsFactors = FALSE)
    
    iconv(Islands$Island, "windows-1252", "UTF-8")
    

    Unfortunately, it doesn't work. It may be possible to get the correct text by using a different conversion (iconvlist() shows all the possibilities).

    It is possible it simply strip out the offending characters, though this isn't ideal.

    iconv(Islands$Island, "windows-1252", "ASCII", "")
    
    0 讨论(0)
提交回复
热议问题