Emoji in R [UTF-8 encoding]

前端 未结 2 1256

I\'m trying to make an emoji analysis on R. I have stored some tweets where there are emojis.

Here is one of the tweet that I want

2条回答
  •  青春惊慌失措
    2021-01-21 09:57

    I use iconv(tweet, 'UTF-8', 'latin1', 'byte') to preserve characters with tilde:

    > tweetn2
    [1] "Prógrämmè dü week-eñd: \xed��\xed�\u0083\xed��\xed��\xed��\xed��\xed��\xed��\xed��\xed�� "
    > iconv(tweetn2, 'UTF-8', 'latin1', 'byte')
    [1] "Prógrämmè dü week-eñd: <83> "
    

    As for the emoji decoding I would suggest using a function implementing nj_'s answer. Or directly using an emoji dictionary like the one I proposed.

    unicode2hilo <- function(unicode){
       hi = floor((unicode - 0x10000)/0x400) + 0xd800
       lo = (unicode - 0x10000) + 0xdc00 - (hi-0xd800)*0x400
       hilo = paste('0x', as.hexmode(c(hi,lo)), sep = '')
       return(hilo)
    }
    
    hilo2unicode <- function(hi,lo){
       unicode = (hi - 0xD800) * 0x400 + lo - 0xDC00 + 0x10000 
       unicode = paste('0x', as.hexmode(unicode), sep = '')
       return(unicode)
    }
    

提交回复
热议问题