问题
Git's blob object file format is blob <size string>\0<data>.
The blob-identifying SHA-1 hash is calculated not from the blob contents alone, but from the header-augmented blob data (as described above).
As a purist I do not like that architecture. It mixes the universal property of the data (its SHA1 hash) with some git-specific header.
Another advantage of pure-data blob storage is that the files can be added to the index using "copy-on-write" instead of copying the whole file. The required space could be halved and some operations could become faster.
So, why did Git developers choose to use the header-based format instead of the pure data format?
P.S. AFAIK in the early days of Git the SHA-1 hash was based on the compressed data.
回答1:
AFAIK in the early days of Git the SHA-1 hash was based on the compressed data.
Yes, and that lead to all kind of "optimizations" like commit 65c2e0c, git 0.99, June 2015:
Find size of SHA1 object without inflating everything.
But that new format, illustrated in "How does git compute file hashes?", can be traced back to:
git diff, in commit 051308f (git 1.4.0-rc1, May 2006)git fast-import, started in commit db5e523 (git 1.5.0, Aug. 2006)
Each time, the length of the data is needed to do anything with the data itself.
来源:https://stackoverflow.com/questions/34425353/why-does-git-store-and-hash-blob-size-in-the-blob-file