Trouble with strings with Unicode characters

前端 未结 2 877
日久生厌
日久生厌 2020-12-20 12:35

I have a very large dataset (70k rows, 2600 columns, CSV format) that I have created by web scraping. Unfortunately, doing the pre-processing, processing etc. at some point

相关标签:
2条回答
  • 2020-12-20 13:13

    Not sure it will work for you but for the same symptoms i did convert the strings to ascii:

    x <- iconv(x, "", "ASCII", "byte")
    

    For non ascii chars, the indication is "<xx>" with the hex code of the byte.

    You can then gsub the hex codes to the values that suit you.

    0 讨论(0)
  • 2020-12-20 13:23

    I've had a bit of a horrible time with this pernicious little problem, but I think/hope I've finally got somewhere.

    After messing around with the read_csv options locale=locale(encoding="xyz") and trying various combinations of other solutions - the gsub solution didn't work, I treid the stringi solution...

    It didn't work, either. But it has a function str_enc_detect, which I ran on the problem values stri_enc_detect(x). It gave me a locale I hadn't tried - in this case windows-1252 - which I promptly set in read_csv options: locale=locale(encoding = "windows-1252")

    Hey presto it's displaying correctly now.

    0 讨论(0)
提交回复
热议问题