Why don't two binaries of programs with only comments changed exactly match in gcc?

大憨熊 提交于 2019-11-29 20:54:36
cyphar

It's because the file names are different (although the strings output is the same). If you try modifying the file itself (rather than having two files), you'll notice that the output binaries are no longer different. As both Jens and I said, it's because GCC dumps a whole load of metadata into the binaries it builds, including the exact source filename (and AFAICS so does clang).

Try this:

$ cp code.c code2.c subdir/code.c
$ gcc code.c -o a
$ gcc code2.c -o b
$ gcc subdir/code.c -o a2
$ diff a b
Binary files a and b differ
$ diff a2 b
Binary files a2 and b differ
$ diff -s a a2
Files a and a2 are identical

This explains why your md5sums don't change between builds, but they are different between different files. If you want, you can do what Jens suggested and compare the output of strings for each binary you'll notice that the filenames are embedded in the binary. If you want to "fix" this, you can strip the binaries and the metadata will be removed:

$ strip a a2 b
$ diff -s a b
Files a and b are identical
$ diff -s a2 b
Files a2 and b are identical
$ diff -s a a2
Files a and a2 are identical

The most common reason are file names and time stamps added by the compiler (usually in the debug info part of the ELF sections).

Try running

 $ strings -a program > x
 ...recompile program...
 $ strings -a program > y
 $ diff x y

and you might see the reason. I once used this to find why the same source would cause different code when compiled in different directories. The finding was that the __FILE__ macro expanded to an absolute file name, different in both trees.

LSerni

Note: remember that the source file name goes into the unstripped binary, so two programs coming from differently named source files will have different hashes.

In similar situations, should the above not apply, you can try:

  • running strip against the binary to remove some fat. If the stripped binaries are the same then it was some metadata that isn't essential to the program operation.
  • generating an assembly intermediate output to verify that the difference is not in the actual CPU instructions (or, however, to better pinpoint where the difference actually is)
  • use strings, or dump both programs to hex and run a diff on the two hex dumps. Once located the difference(s), you might try and see whether there's some rhyme or reason to them (PID, timestamps, source file timestamp...). For example you might have a routine storing the timestamp at compile time for diagnostic purposes.
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!