Search for unicode values in character string

问题

I am trying to identify unique unicode values in a data frame composed of character strings. I have tried using the grep function, however I encounter the following error

Error: '\U' used without hex digits in character string starting ""\U"

A example data frame

                     time sender                                                    message
1     2012-12-04 13:40:00      1                                            Hello handsome!
2     2012-12-04 13:40:08      1                                                 \U0001f618
3     2012-12-04 14:39:24      1                                                 \U0001f603
4     2012-12-04 16:04:25      2                                            <image omitted>
73    2012-12-05 06:02:17      1 Haha not white and blue... White with blue eyes \U0001f61c
40619 2015-05-08 10:00:58      1                                       \U0001f631\U0001f637

grep("\U", dat$messages)

data

dat <- 
structure(list(time = c("2012-12-04 13:40:00", "2012-12-04 13:40:08", 
"2012-12-04 14:39:24", "2012-12-04 16:04:25", "2012-12-05 06:02:17", 
"2015-05-08 10:00:58"), sender = c(1L, 1L, 1L, 2L, 1L, 1L), message = c("Hello handsome!", 
"\U0001f618", "\U0001f603", "<image omitted>", "Haha not white and blue... White with blue eyes \U0001f61c", 
"\U0001f631\U0001f637")), .Names = c("time", "sender", "message"
), class = "data.frame", row.names = c("1", "2", "3", "4", "73", 
"40619"))

回答1:

I'm assuming by "unicode character" you just mean non-ASCII characters. Character codes can mean different things depending on encodings. R represents values outside of the current encoding with a special \U sequence. Note that neither the slash nor the letter "U" actually appear in the real data. This is just how they are escaped to be printed onscreen when the appropriate glyph isn't available.

For example, even though the last message looks like it's long, it's actually only two characters long

dat$message[6]
# [1] "\U0001f631\U0001f637"
nchar(dat$message[6])
# [1] 2

You can find non-ASCII codes using regular expressions pretty easily. ASCII characters all have codes 0-128 (or 000 to 177 in octal). You can find values outside that range with

grep("[^\001-\177]", dat$message)
# [1] 2 3 5 6

回答2:

Try:

library(stringi)
stri_enc_isascii(dat$message)

Which gives:

# [1]  TRUE FALSE FALSE  TRUE FALSE FALSE

来源：https://stackoverflow.com/questions/30794201/search-for-unicode-values-in-character-string

标签

unicode

grep

gsub