发表新帖

发表新帖

Flex(lexer) support for unicode

后端未结

关注

 3  2020

长情又很酷 2020-12-09 03:15

I am wondering if the newest version of flex supports unicode?

If so, how can use patterns to match Chinese characters?

More: Use regular expression to match

3条回答

心在旅途 (楼主)

2020-12-09 03:33
At the moment, flex only generates 8-bit scanners which basically limits you to use UTF-8. So if you have a pattern:
```
肖晗   { printf ("xiaohan\n"); }
```
it will work as expected, as the sequence of bytes in the pattern and in the input will be the same. What's more difficult is character classes. If you want to match either the character 肖 or 晗, you can't write:
```
[肖晗]   { printf ("xiaohan/2\n"); }
```
because this will match each of the six bytes 0xe8, 0x82, 0x96, 0xe6, 0x99 and 0x97, which in practice means that if you supply 肖晗 as the input, the pattern will match six times. So in this simple case, you have to rewrite the pattern to (肖|晗).

For ranges, Hans Aberg has written a tool in Haskell that transforms these into 8-bit patterns:
```
Unicode> urToRegU8 0 0xFFFF
[\0-\x7F]|[\xC2-\xDF][\x80-\xBF]|(\xE0[\xA0-\xBF]|[\xE1-\xEF][\x80-\xBF])[\x80-\xBF]
Unicode> urToRegU32 0x00010000 0x001FFFFF
\0[\x01-\x1F][\0-\xFF][\0-\xFF]
Unicode> urToRegU32L 0x00010000 0x001FFFFF
[\x01-\x1F][\0-\xFF][\0-\xFF]\0
```
This isn't pretty, but it should work.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

热议问题