Force character vector encoding from “unknown” to “UTF-8” in R

后端 未结 2 1924
挽巷
挽巷 2020-11-29 18:31

I have a problem with inconsistent encoding of character vector in R.

The text file which I read a table from is encoded (via Notepad++

2条回答
  •  [愿得一人]
    2020-11-29 19:19

    I could not find a solution myself to a similar problem. I could not translate back unknown encoding characters from txt file into something more manageable in R.

    Therefore, I was in a situation that the same character appeared more than once in the same dataset, because it was encoded differently ("X" in Latin setting and "X" in Greek setting). However, txt saving operation preserved that encoding difference --- of course well-done.

    Trying some of the above methods, nothing worked. The problem is well described “cannot distinguish ASCII from UTF-8 and the bit will not stick even if you set it”.

    A good workaround is " export your data.frame to a CSV temporary file and reimport with data.table::fread() , specifying Latin-1 as source encoding.".

    Reproducing / copying the example given from the above source:

    package(data.table)
    df <- your_data_frame_with_mixed_utf8_or_latin1_and_unknown_str_fields
    fwrite(df,"temp.csv")
    your_clean_data_table <- fread("temp.csv",encoding = "Latin-1")
    

    I hope, it will help someone that.

提交回复
热议问题