Search for unicode values in character string

老子叫甜甜 提交于 2020-05-15 04:49:17

问题


I am trying to identify unique unicode values in a data frame composed of character strings. I have tried using the grep function, however I encounter the following error

Error: '\U' used without hex digits in character string starting ""\U"

A example data frame

                     time sender                                                    message
1     2012-12-04 13:40:00      1                                            Hello handsome!
2     2012-12-04 13:40:08      1                                                 \U0001f618
3     2012-12-04 14:39:24      1                                                 \U0001f603
4     2012-12-04 16:04:25      2                                            <image omitted>
73    2012-12-05 06:02:17      1 Haha not white and blue... White with blue eyes \U0001f61c
40619 2015-05-08 10:00:58      1                                       \U0001f631\U0001f637

grep("\U", dat$messages)

data

dat <- 
structure(list(time = c("2012-12-04 13:40:00", "2012-12-04 13:40:08", 
"2012-12-04 14:39:24", "2012-12-04 16:04:25", "2012-12-05 06:02:17", 
"2015-05-08 10:00:58"), sender = c(1L, 1L, 1L, 2L, 1L, 1L), message = c("Hello handsome!", 
"\U0001f618", "\U0001f603", "<image omitted>", "Haha not white and blue... White with blue eyes \U0001f61c", 
"\U0001f631\U0001f637")), .Names = c("time", "sender", "message"
), class = "data.frame", row.names = c("1", "2", "3", "4", "73", 
"40619"))

回答1:


I'm assuming by "unicode character" you just mean non-ASCII characters. Character codes can mean different things depending on encodings. R represents values outside of the current encoding with a special \U sequence. Note that neither the slash nor the letter "U" actually appear in the real data. This is just how they are escaped to be printed onscreen when the appropriate glyph isn't available.

For example, even though the last message looks like it's long, it's actually only two characters long

dat$message[6]
# [1] "\U0001f631\U0001f637"
nchar(dat$message[6])
# [1] 2

You can find non-ASCII codes using regular expressions pretty easily. ASCII characters all have codes 0-128 (or 000 to 177 in octal). You can find values outside that range with

grep("[^\001-\177]", dat$message)
# [1] 2 3 5 6



回答2:


Try:

library(stringi)
stri_enc_isascii(dat$message)

Which gives:

# [1]  TRUE FALSE FALSE  TRUE FALSE FALSE


来源:https://stackoverflow.com/questions/30794201/search-for-unicode-values-in-character-string

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!