Reading in Unicode Emoji correctly into R

问题

I have a set of comments from Facebook (pulled via a system like Sprinkr) that contain both text and emojis, and I'm trying to run a variety of analysis on them in R, but running into difficulty into ingesting the emoji characters correctly.

For example: I have a .csv (encoded in UTF-8) that will have a message line containing something like this:

"IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cups💚💚💚"

I then ingest it into R in the following way:

library(tidyverse)
library(janitor)
raw.fb.comments <- read_csv("data.csv",
                            locale = locale(encoding="UTF-8"))
fb.comments <- raw.fb.comments %>%
  clean_names() %>%
  filter(senderscreenname != "Reese's") %>% 
  select(c(message,messagetype,sentiment)) %>%
  mutate(type = "Facebook")
fb.comments$message[5]
[1] "IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cups\xf0\u009f\u0092\u009a\xf0\u009f\u0092\u009a\xf0\u009f\u0092\u009a\n\n"

Now, from what I understand from other sources, I need to transform this UTF-8 into ASCII, which I can then use to link it up with other emoji resources (like the wonderful emojidictionary). To make the join work, I need to get this into R-encoding, something like this:

<e2><9d><a4><ef><b8><8f>

However, adding the normal step (using iconv) doesn't get me there:

fb.comments <- raw.fb.comments %>%
  clean_names() %>%
  filter(senderscreenname != "Reese's") %>% 
  select(c(message,messagetype,sentiment)) %>%
  mutate(type = "Facebook") %>%
  mutate(message = iconv(message, from="UTF-8", to="ascii",sub="byte"))
fb.comments$message[5]
[1] "IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cups<f0><9f><92><9a><f0><9f><92><9a><f0><9f><92><9a>\n\n"

Can anyone out there illuminate to me what I'm missing, or do I need to find a different emoji mapping resource? Thanks!

回答1:

The goal is not really clear, but I suspect that giving up on representing emoji correcty and just representing it as bytes is not the best way. If for example you wish to convert emoji to their description you can do something like this:

x <- "IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cups💚💚💚"

## read emoji info and get rid of documentation lines
readLines("https://unicode.org/Public/emoji/5.0/emoji-test.txt",
          encoding="UTF-8") %>%
    stri_subset_regex(pattern = "^[^#]") %>%
    stri_subset_regex(pattern = ".+") -> emoji

## get the emoji characters and clean them up
emoji %>%
    stri_extract_all_regex(pattern = "# *.{1,2} *") %>%
    stri_replace_all_fixed(pattern = c("*", "#"),
                           replacement = "",
                           vectorize_all=FALSE) %>%
    stri_trim_both() -> emoji.chars

## get the emoji character descriptions
emoji %>%
    stri_extract_all_regex(pattern = "#.*$") %>%
    stri_replace_all_regex(pattern = "# *.{1,2} *",
                           replacement = "") %>%
    stri_trim_both() -> emoji.descriptions


## replace emoji characters with their descriptions.
stri_replace_all_regex(x,
                       pattern = emoji.chars,
                       replacement = emoji.descriptions,
                       vectorize_all=FALSE)

## [1] "IS THIS CORRECT!?!?! Please say it isn't true!!! Our family only eats the original Reeses Peanut Butter Cupsgreen heartgreen heartgreen heart"

来源：https://stackoverflow.com/questions/47675990/reading-in-unicode-emoji-correctly-into-r

标签

text

unicode

utf-8

emoji