grep unicode 16 support

问题

I use TextEdit on macosx created two files, same contents with different encodings, then

grep xxx filename_UTF-16

nothing

grep xxx filename_UTF-8

xxxxxxx xxxxxxyyyyyy

grep did not support UTF-16?

回答1:

iconv -f UTF-16 -t UTF-8 yourfile | grep xxx

回答2:

You could always try converting first to utf-8:

iconv -f utf-16 -t utf-8 filename | grep xxxxx

回答3:

Use ripgrep utility instead of grep which can support grepping UTF-16 files. Install by: brew install ripgrep.

Then run:

rg xxx filename_UTF-16

ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the -E/--encoding flag.)

回答4:

Define the following Ruby's shell function:

grep16() { ruby -e "puts File.open('$2', mode:'rb:BOM|UTF-16LE').readlines.grep(Regexp.new '$1'.encode(Encoding::UTF_16LE))"; }

Then use it as:

grep16 xxx filename_UTF-16

See: How to use Ruby's readlines.grep for UTF-16 files?

For more suggestions, check: grepping binary files and UTF16

回答5:

You could also use ugrep which is a drop-in replacement of grep and backwards compatible to GNU/BSD grep, meaning it takes the same options as grep but offers vastly more features, such as:

ugrep searches UTF-encoded input when UTF BOM (byte order mark) are present and ASCII and UTF-8 when no UTF BOM is present. Option --encoding permits many other file formats to be searched, such as ISO-8859-1, EBCDIC, and code pages 437, 850, 858, 1250 to 1258.

ugrep matches Unicode patterns by default (disabled with option -U). The regular expression syntax is POSIX ERE compliant, extended with Unicode character classes, lazy quantifiers, and negative patterns to skip unwanted pattern matches to produce more precise results.

ugrep searches text files and binary files and produces hexdumps for binary matches.

来源：https://stackoverflow.com/questions/6882070/grep-unicode-16-support

标签

Linux

unicode

utf-8

grep

utf-16