How can I match emoji with an R regex?

試著忘記壹切 提交于 2019-12-04 03:41:16
PKumar

I am converting the encoding to UTF-8 to compare the UTF-8 value of emoji's value with all the emoji's value in remoji library which is in UTF-8. I am using the stringr library to find the position of emoji's in the vector. One is free to use grep or any other function.

1st Method:

library(stringr)
xvect = c('😂', 'no', '🍹', '😀', 'no', '😛')

Encoding(xvect) <- "UTF-8"

which(str_detect(xvect,"[^[:ascii:]]")==T)
# [1] 1 3 4 6

Here 1,3,4 and 6 are emoji's character in this case.

Edited :

2nd Method: Install a package called remoji using devtools using below command, Since we have already converted the emoji items into UTF-8. we can now compare the UTF-8 values of all the emoji's present in the emoji library. Use trimws to remove the whitespaces

install.packages("devtools")

devtools::install_github("richfitz/remoji")
library(remoji)
emj <- emoji(list_emoji(), TRUE)
xvect %in% trimws(emj)

Output:

which(xvect %in% trimws(emo))
# [1] 1 3 4 6

Both of the above methods are not full proof and first method assumes that there are no any ascii characters other than emojis in the vector and second method relies on the library information of remoji. In case where the a certain emoji information is not present in the library, the last command may yield a FALSE instead of TRUE.

Final Edit:

As per the discussion amongst OP(@MichaelChirico) and @SymbolixAU. Thanks to both of them it seems the problem with small typo of capital U. The new regex is xvect[grepl('[\U{1F300}-\U{1F6FF}]', xvect)] . The range in the character class is taken from F300 to F6FF. One can off course change this range to a new range in cases where an emoji lies outside this range. This may not be the complete list and over the period of time these ranges may keep increasing/changing.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!