What does constitute one character for regcomp? Which multibyte encoding does determine this?

血红的双手。 提交于 2019-12-25 07:51:02

问题


regcomp (from glibc) is a POSIX function for compiling regular expressions.

     int regcomp(regex_t *restrict preg, const char *restrict pattern,
     int cflags);

There are some constructions in regular expressions which depend on the idea of a single character, for example [abc].

If a multibyte encoding is used and a multibyte letter is used in the expression, the interpretation would be different if it treated either as a byte-sequence or a sequence of multibyte letters.

Here I illustrate this idea with grep (which must not be the same in this respect as the C function regcomp):

$ { echo Г; echo Д; } | egrep '[Д]'
Д
$ { echo Г; echo Д; } | LANG=C egrep '[Д]'
Г
Д
$ 

LANG is the default value if any of the specific locale variables are not set, so the question is: which one of them would affect the regcomp's idea about the encoding.

$ locale
LANG=ru_RU.utf8
LC_CTYPE="ru_RU.utf8"
LC_NUMERIC="ru_RU.utf8"
LC_TIME="ru_RU.utf8"
LC_COLLATE="ru_RU.utf8"
LC_MONETARY="ru_RU.utf8"
LC_MESSAGES=POSIX
LC_PAPER="ru_RU.utf8"
LC_NAME="ru_RU.utf8"
LC_ADDRESS="ru_RU.utf8"
LC_TELEPHONE="ru_RU.utf8"
LC_MEASUREMENT="ru_RU.utf8"
LC_IDENTIFICATION="ru_RU.utf8"
LC_ALL=
$ 

回答1:


As for grep (which must not have the same behavior as regcomp), it seems to honor LC_CTYPE for this decision:

$ { echo Г; echo Д; } | LANG=en_US.utf8 egrep '[Д]'
Д
$ { echo Г; echo Д; } | LANG=en_US.utf8 LC_COLLATE=C egrep '[Д]'
Д
$ { echo Г; echo Д; } | LANG=en_US.utf8 LC_CTYPE=C egrep '[Д]'
Г
Д
$ 


来源:https://stackoverflow.com/questions/40809460/what-does-constitute-one-character-for-regcomp-which-multibyte-encoding-does-de

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!