Searching for non-ascii characters

后端未结

关注

 3  823

难免孤独 2021-01-24 10:24

I have a file, a.out, which contains a number of lines. Each line is one character only, either the unicode character U+2013 or a lower case letter a-z

3条回答

无人共我 (楼主)

2021-01-24 10:58

A comment in How Do I grep For all non-ASCII Characters in UNIX gives the answer:

Grep (and family) don't do Unicode processing to merge multi-byte characters into a single entity for regex matching as you seem to want.

That implies that the UTF-8 encoding for U+2013 (0xe2, 0x80, 0x93) is not treated by grep as parts of a single printable character outside the given range.

The GNU grep manual's description of -P does not mention Unicode or UTF-8. Rather, it says Interpret the pattern as a Perl regular expression. (this does not mean that the result is identical to Perl, only that some of the backslash-escapes are similar).

Perl itself can be told to use UTF-8 encoding. However the examples using Perl in Filtering invalid utf8 do not use that feature. Instead, the expressions (like those in the problematic grep) test only the individual bytes -- not the complete character.

0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...