grepping binary files and UTF16

自作多情 提交于 2019-11-27 01:05:43

The easiest way is to just convert the text file to utf-8 and pipe that to grep:

iconv -f utf-16 -t utf-8 file.txt | grep query

I tried to do the opposite (convert my query to utf-16) but it seems as though grep doesn't like that. I think it might have to do with endianness, but I'm not sure.

It seems as though grep will convert a query that is utf-16 to utf-8/ascii. Here is what I tried:

grep `echo -n query | iconv -f utf-8 -t utf-16 | sed 's/..//'` test.txt

If test.txt is a utf-16 file this won't work, but it does work if test.txt is ascii. I can only conclude that grep is converting my query to ascii.

EDIT: Here's a really really crazy one that kind of works but doesn't give you very much useful info:

hexdump -e '/1 "%02x"' test.txt | grep -P `echo -n Test | iconv -f utf-8 -t utf-16 | sed 's/..//' | hexdump -e '/1 "%02x"'`

How does it work? Well it converts your file to hex (without any extra formatting that hexdump usually applies). It pipes that into grep. Grep is using a query that is constructed by echoing your query (without a newline) into iconv which converts it to utf-16. This is then piped into sed to remove the BOM (the first two bytes of a utf-16 file used to determine endianness). This is then piped into hexdump so that the query and the input are the same.

Unfortunately I think this will end up printing out the ENTIRE file if there is a single match. Also this won't work if the utf-16 in your binary file is stored in a different endianness than your machine.

EDIT2: Got it!!!!

grep -P `echo -n "Test" | iconv -f utf-8 -t utf-16 | sed 's/..//' | hexdump -e '/1 "x%02x"' | sed 's/x/\\\\x/g'` test.txt

This searches for the hex version of the string Test (in utf-16) in the file test.txt

You can explicitly include the nulls (00s) in the search string, though you will get results with nulls, so you may want to redirect the output to a file so you can look at it with a reasonable editor, or pipe it through sed to replace the nulls. To search for "bar" in *.utf16.txt:

grep -Pa "b\x00a\x00r" *.utf16.txt | sed 's/\x00//g'

The "-P" tells grep to accept Perl regexp syntax, which allows \x00 to expand to null, and the -a tells it to ignore the fact that Unicode looks like binary to it.

I found the below solution worked best for me, from https://www.splitbits.com/2015/11/11/tip-grep-and-unicode/

Grep does not play well with Unicode, but it can be worked around. For example, to find,

Some Search Term

in a UTF-16 file, use a regular expression to ignore the first byte in each character,

S.o.m.e. .S.e.a.r.c.h. .T.e.r.m 

Also, tell grep to treat the file as text, using '-a', the final command looks like this,

grep -a 'S.o.m.e. .S.e.a.r.c.h. .T.e.r.m' utf-16-file.txt

I use this one all the time after dumping the Windows registry as its output is unicode. This is running under Cygwin.

$ regedit /e registry.data.out
$ file registry.data.out
registry.data.out: Little-endian **UTF-16 Unicode text**, with CRLF line terminators

$ sed 's/\x00//g' registry.data.out | egrep "192\.168"
"Port"="192.168.1.5"
"IPSubnetAddress"="192.168.189.0"
"IPSubnetAddress"="192.168.102.0"
[HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Control\Print\Monitors\Standard TCP/IP Port\Ports\192.168.1.5]
"HostName"="192.168.1.5"
"Port"="192.168.1.5"
"LocationInformation"="http://192.168.1.28:1215/"
"LocationInformation"="http://192.168.1.5:80/WebServices/Device"
"LocationInformation"="http://192.168.1.5:80/WebServices/Device"
"StandaloneDhcpAddress"="192.168.173.1"
"ScopeAddressBackup"="192.168.137.1"
"ScopeAddress"="192.168.137.1"
"DhcpIPAddress"="192.168.1.24"
"DhcpServer"="192.168.1.1"
"0.0.0.0,0.0.0.0,192.168.1.1,-1"=""
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Print\Monitors\Standard TCP/IP Port\Ports\192.168.1.5]
"HostName"="192.168.1.5"
"Port"="192.168.1.5"
"LocationInformation"="http://192.168.1.28:1215/"
"LocationInformation"="http://192.168.1.5:80/WebServices/Device"
"LocationInformation"="http://192.168.1.5:80/WebServices/Device"
"StandaloneDhcpAddress"="192.168.173.1"
"ScopeAddressBackup"="192.168.137.1"
"ScopeAddress"="192.168.137.1"
"DhcpIPAddress"="192.168.1.24"
"DhcpServer"="192.168.1.1"
"0.0.0.0,0.0.0.0,192.168.1.1,-1"=""
"MRU0"="192.168.16.93"
[HKEY_USERS\S-1-5-21-2054485685-3446499333-1556621121-1001\Software\Microsoft\Terminal Server Client\Servers\192.168.16.93]
"A"="192.168.1.23"
"B"="192.168.1.28"
"C"="192.168.1.200:5800"
"192.168.254.190::5901/extra"=hex:02,00
"00"="192.168.254.190:5901"
"ImagePrinterPort"="192.168.1.5"

I needed to do this recursively, and here's what I came up with:

find -type f | while read l; do iconv -s -f utf-16le -t utf-8 "$l" | nl -s "$l: " | cut -c7- | grep 'somestring'; done

This is absolutely horrible and very slow; I'm certain there's a better way and I hope someone can improve on it -- but I was in a hurry :P

What the pieces do:

find -type f

gives a recursive list of filenames with paths relative to current

while read l; do ... done

Bash loop; for each line of the list of file paths, put the path into $l and do the thing in the loop. (Why I used a shell loop instead of xargs, which would've been much faster: I need to prefix each line of the output with the name of the current file. Couldn't think of a way to do that if I was feeding multiple files at once to iconv, and since I'm going to be doing one file at a time anyway, shell loop is easier syntax/escaping.)

iconv -s -f utf-16le -t utf-8 "$l"

Convert the file named in $l: assume the input file is utf-16 little-endian and convert it to utf-8. The -s makes iconv shut up about any conversion errors (there will be a lot, because some files in this directory structure are not utf-16). The output from this conversion goes to stdout.

nl -s "$l: " | cut -c7-

This is a hack: nl inserts line numbers, but it happens to have a "use this arbitrary string to separate the number from the line" parameter, so I put the filename (followed by colon and space) in that. Then I use cut to strip off the line number, leaving just the filename prefix. (Why I didn't use sed: escaping is much easier this way. If I used a sed expression, I have to worry about there regular expression characters in the filenames, which in my case there were a lot of. nl is much dumber than sed, and will just take the parameter -s entirely literally, and the shell handles the escaping for me.)

So, by the end of this pipeline, I've converted a bunch of files into lines of utf-8, prefixed with the filename, which I then grep. If there are matches, I can tell which file they're in from the prefix.

Caveats

  • This is much, much slower than grep -R, because I'm spawning a new copy of iconv, nl, cut, and grep for every single file. It's horrible.
  • Everything that isn't utf-16le input will come out as complete garbage, so if there's a normal ASCII file that contains 'somestring', this command won't report it -- you need to do a normal grep -R as well as this command (and if you have multiple unicode encoding types, like some big-endian and some little-endian files, you need to adjust this command and run it again for each different encoding).
  • Files whose name happens to contain 'somestring' will show up in the output, even if their contents have no matches.

ripgrep

Use ripgrep utility to grep UTF-16 files.

ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the -E/--encoding flag.)

Example syntax:

rg sometext file

To dump all lines, run: rg -N . file.

The sed statement is more than I can wrap my head around. I have a simplistic, far-from-perfect TCL script that I think does an OK job with my test point of one:

#!/usr/bin/tclsh

set insearch [lindex $argv 0]

set search ""

for {set i 0} {$i<[string length $insearch]-1} {incr i} {
    set search "${search}[string range $insearch $i $i]."
}
set search "${search}[string range $insearch $i $i]"

for {set i 1} {$i<$argc} {incr i} {
    set file [lindex $argv $i]
    set status 0
    if {! [catch {exec grep -a $search $file} results options]} {
        puts "$file: $results"
    }
}

I added this as a comment to the accepted answer above but to make it easier to read. This allow you to search for text in a bunch of files while also displaying the filenames that it is finding the text. All of these files have a .reg extension since I'm searching through exported Windows Registry files. Just replace .reg with any file extension.

// Define grepreg in bash by pasting at bash command prompt
grepreg ()
{
    find -name '*.reg' -exec echo {} \; -exec iconv -f utf-16 -t utf-8 {} \; | grep "$1\|\.reg"
}

// Sample usage
grepreg SampleTextToSearch

You can use the following Ruby's one-liner:

ruby -e "puts File.open('file.txt', mode:'rb:BOM|UTF-16LE').readlines.grep(Regexp.new 'PATTERN'.encode(Encoding::UTF_16LE))"

For simplicity, this can be defined as the shell function like:

grep-utf16() { ruby -e "puts File.open('$2', mode:'rb:BOM|UTF-16LE').readlines.grep(Regexp.new '$1'.encode(Encoding::UTF_16LE))"; }

Then it be used in similar way like grep:

grep-utf16 PATTERN file.txt

Source: How to use Ruby's readlines.grep for UTF-16 files?

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!