Reading in Unicode Emoji correctly into R

蓝咒 提交于 2020-01-15 05:01:48

问题


I have a set of comments from Facebook (pulled via a system like Sprinkr) that contain both text and emojis, and I'm trying to run a variety of analysis on them in R, but running into difficulty into ingesting the emoji characters correctly.

For example: I have a .csv (encoded in UTF-8) that will have a message line containing something like this:

"IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cups💚💚💚"

I then ingest it into R in the following way:

library(tidyverse)
library(janitor)
raw.fb.comments <- read_csv("data.csv",
                            locale = locale(encoding="UTF-8"))
fb.comments <- raw.fb.comments %>%
  clean_names() %>%
  filter(senderscreenname != "Reese's") %>% 
  select(c(message,messagetype,sentiment)) %>%
  mutate(type = "Facebook")
fb.comments$message[5]
[1] "IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cups\xf0\u009f\u0092\u009a\xf0\u009f\u0092\u009a\xf0\u009f\u0092\u009a\n\n"

Now, from what I understand from other sources, I need to transform this UTF-8 into ASCII, which I can then use to link it up with other emoji resources (like the wonderful emojidictionary). To make the join work, I need to get this into R-encoding, something like this:

<e2><9d><a4><ef><b8><8f>

However, adding the normal step (using iconv) doesn't get me there:

fb.comments <- raw.fb.comments %>%
  clean_names() %>%
  filter(senderscreenname != "Reese's") %>% 
  select(c(message,messagetype,sentiment)) %>%
  mutate(type = "Facebook") %>%
  mutate(message = iconv(message, from="UTF-8", to="ascii",sub="byte"))
fb.comments$message[5]
[1] "IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cups<f0><9f><92><9a><f0><9f><92><9a><f0><9f><92><9a>\n\n"

Can anyone out there illuminate to me what I'm missing, or do I need to find a different emoji mapping resource? Thanks!


回答1:


The goal is not really clear, but I suspect that giving up on representing emoji correcty and just representing it as bytes is not the best way. If for example you wish to convert emoji to their description you can do something like this:

x <- "IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cups💚💚💚"

## read emoji info and get rid of documentation lines
readLines("https://unicode.org/Public/emoji/5.0/emoji-test.txt",
          encoding="UTF-8") %>%
    stri_subset_regex(pattern = "^[^#]") %>%
    stri_subset_regex(pattern = ".+") -> emoji

## get the emoji characters and clean them up
emoji %>%
    stri_extract_all_regex(pattern = "# *.{1,2} *") %>%
    stri_replace_all_fixed(pattern = c("*", "#"),
                           replacement = "",
                           vectorize_all=FALSE) %>%
    stri_trim_both() -> emoji.chars

## get the emoji character descriptions
emoji %>%
    stri_extract_all_regex(pattern = "#.*$") %>%
    stri_replace_all_regex(pattern = "# *.{1,2} *",
                           replacement = "") %>%
    stri_trim_both() -> emoji.descriptions


## replace emoji characters with their descriptions.
stri_replace_all_regex(x,
                       pattern = emoji.chars,
                       replacement = emoji.descriptions,
                       vectorize_all=FALSE)

## [1] "IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cupsgreen heartgreen heartgreen heart"


来源:https://stackoverflow.com/questions/47675990/reading-in-unicode-emoji-correctly-into-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!