Flex(lexer) support for unicode

后端 未结 3 2018
长情又很酷
长情又很酷 2020-12-09 03:15

I am wondering if the newest version of flex supports unicode?

If so, how can use patterns to match Chinese characters?

More: Use regular expression to match

3条回答
  •  情歌与酒
    2020-12-09 03:46

    I am wondering if the newest version of flex supports unicode?

    If so, how can use patterns to match Chinese characters?

    To match patterns with Chinese characters and other Unicode code points with a Flex-compatible lexical analyzer, you could use the RE/flex lexical analyzer for C++.

    RE/flex safely supports the full Unicode 12 standard and accepts UTF-8, UTF-16, and UTF-32 input files without requiring UTF-8 hacks that can't even support UTF-16/32 input.

    Also, UTF-8 hacks with Flex don't allow you to write Unicode regular expressions such as [肖晗] that are fully supported in RE/flex.

    It works seamlessly with Bison to build lexers and parsers.

    In fact, with RE/flex we can write any Unicode patterns as UTF-8-based regular expressions in lexer .l specifications, such as:

    %option flex unicode
    %%
    [肖晗]   { printf ("xiaohan/2\n"); }
    %%
    

    This generates a lexer that scans UTF-8, UTF-16, and UTF-32 files automatically. As per UTF standardization, for UTF-16/32 input a UTF BOM is expected in the input, while an UTF-8 BOM is optional.

    We can use global %option unicode to enable Unicode and %option flex to specify Flex specifications. A local modifier (?u:) can be used to restrict Unicode to a single pattern (so everything else is still ASCII/8-bit as in Flex):

    %option flex
    %%
    (?u:[肖晗])   { printf ("xiaohan/2\n"); }
    (?u:\p{Han})  { printf ("Han character %s\n", yytext); }
    .             { printf ("8-bit character %d\n", yytext[0]); }
    %%
    

    Option flex enables Flex compatibility, so you can use yytext, yyleng, ECHO, and so on. Without the flex option RE/flex expects Lexer method calls: text() (or str() and wstr() for std::string and std::wstring), size() (or wsize() for wide char length), and echo(). RE/flex method calls are cleaner IMHO, and include wide char operations.

提交回复
热议问题