发表新帖

发表新帖

Force character vector encoding from “unknown” to “UTF-8” in R

后端未结

关注

 2  1924

挽巷 2020-11-29 18:31

I have a problem with inconsistent encoding of character vector in R.

The text file which I read a table from is encoded (via Notepad++

2条回答

[愿得一人] (楼主)

2020-11-29 19:19
I could not find a solution myself to a similar problem. I could not translate back unknown encoding characters from txt file into something more manageable in R.

Therefore, I was in a situation that the same character appeared more than once in the same dataset, because it was encoded differently ("X" in Latin setting and "X" in Greek setting). However, txt saving operation preserved that encoding difference --- of course well-done.

Trying some of the above methods, nothing worked. The problem is well described “cannot distinguish ASCII from UTF-8 and the bit will not stick even if you set it”.

A good workaround is " export your data.frame to a CSV temporary file and reimport with data.table::fread() , specifying Latin-1 as source encoding.".

Reproducing / copying the example given from the above source:
```
package(data.table)
df <- your_data_frame_with_mixed_utf8_or_latin1_and_unknown_str_fields
fwrite(df,"temp.csv")
your_clean_data_table <- fread("temp.csv",encoding = "Latin-1")
```
I hope, it will help someone that.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

热议问题