Flex(lexer) support for unicode

后端 未结 3 2020
长情又很酷
长情又很酷 2020-12-09 03:15

I am wondering if the newest version of flex supports unicode?

If so, how can use patterns to match Chinese characters?

More: Use regular expression to match

3条回答
  •  心在旅途
    2020-12-09 03:33

    At the moment, flex only generates 8-bit scanners which basically limits you to use UTF-8. So if you have a pattern:

    肖晗   { printf ("xiaohan\n"); }
    

    it will work as expected, as the sequence of bytes in the pattern and in the input will be the same. What's more difficult is character classes. If you want to match either the character 肖 or 晗, you can't write:

    [肖晗]   { printf ("xiaohan/2\n"); }
    

    because this will match each of the six bytes 0xe8, 0x82, 0x96, 0xe6, 0x99 and 0x97, which in practice means that if you supply 肖晗 as the input, the pattern will match six times. So in this simple case, you have to rewrite the pattern to (肖|晗).

    For ranges, Hans Aberg has written a tool in Haskell that transforms these into 8-bit patterns:

    Unicode> urToRegU8 0 0xFFFF
    [\0-\x7F]|[\xC2-\xDF][\x80-\xBF]|(\xE0[\xA0-\xBF]|[\xE1-\xEF][\x80-\xBF])[\x80-\xBF]
    Unicode> urToRegU32 0x00010000 0x001FFFFF
    \0[\x01-\x1F][\0-\xFF][\0-\xFF]
    Unicode> urToRegU32L 0x00010000 0x001FFFFF
    [\x01-\x1F][\0-\xFF][\0-\xFF]\0
    

    This isn't pretty, but it should work.

提交回复
热议问题