unicode characters conversion in R

后端 未结 3 716
既然无缘
既然无缘 2020-12-03 19:48

I have this MTST column, which when printed yields

 [1] \"G          


        
3条回答
  •  天涯浪人
    2020-12-03 20:12

    What you have there looks like plain 7-bit ASCII characters with some attempt at encoding Unicode code-points by wrapping some of them thus: .

    This is not a recognised encoding for Unicode, as far as I can tell, partly because how would you put a real < in your text? I suppose every < could be where jklm is the code for an angle bracket... But ick.

    So, first, try and get a UTF-8 encoded string from whatever generated this ascii-encoded mess!

    However... after some serious hair pulling...

    stringi to the rescue! Where 'MTST' is your vector of stuff, first convert the angle bracket notation to backslash-u and then use stri_unescape_unicode:

    > require(stringi)
    > greek2=gsub(">","", gsub(" stri_unescape_unicode(greek2)
    [1] "ΑGΡΙΝΙΟ                                 "
    [2] "ΑGΧΙΑΛΟS                                "
    [3] "ΑΙGΙΝΑ                                  "
    [4] "ΑΙGΙΟ                                   "
    [5] "ΑΙΔΗΨΟS                                 "
    [6] "ΑΚΤΙΟ(ΠΡΕΒΕΖΑ)                          "
    

    all the way up to

    [123] "FΥΧΤΙΑ                                  "
    [124] "ΧΑΛΚΙΔΑ                                 "
    [125] "ΧΑΝΙΑ                                   "
    [126] "ΧΙΟS                                    "
    [127] "ΧΡΥSΟΥΠΟΛΗ_ΚΑΒΑΛΑ                       "
    [128] "OΡΕΟΙ                                   "
    

    once I fixed the bizarrely missing comma and quote mark in your "dput" data (edited your question for you).

提交回复
热议问题