How does Git create unique commit hashes, mainly the first few characters?

烈酒焚心 提交于 2019-11-29 02:48:50

问题


I find it hard to wrap my head around how Git creates fully unique hashes that aren't allowed to be the same even in the first 4 characters. I'm able to call commits in Git Bash using only the first four characters. Is it specifically decided in the algorithm that the first characters are "ultra"-unique and will not ever conflict with other similar hashes, or does the algorithm generate every part of the hash in the same way?


回答1:


Git uses the following information to generate the sha-1:

  • The source tree of the commit (which unravels to all the subtrees and blobs)
  • The parent commit sha1
  • The author info (with timestamp)
  • The committer info (right, those are different!, also with timestamp)
  • The commit message

(on the complete explanation; look here).

Git does NOT guarantee that the first 4 characters will be unique. In chapter 7 of the Pro Git Book it is written:

Git can figure out a short, unique abbreviation for your SHA-1 values. If you pass --abbrev-commit to the git log command, the output will use shorter values but keep them unique; it defaults to using seven characters but makes them longer if necessary to keep the SHA-1 unambiguous:

So Git just makes the abbreviation as long as necessary to remain unique. They even note that:

Generally, eight to ten characters are more than enough to be unique within a project.

As an example, the Linux kernel, which is a pretty large project with over 450k commits and 3.6 million objects, has no two objects whose SHA-1s overlap more than the first 11 characters.

So in fact they just depend on the great improbability of having the exact same (X first characters of a) sha.




回答2:


Apr. 2017: Beware that after the all shattered.io episode (where a SHA1 collision was achieved by Google), the 20-byte format won't be there forever.

A first step for that is to replace unsigned char sha1[20] which is hard-code all over the Git codebase by a generic object whose definition might change in the future (SHA2?, Blake2, ...)

See commit e86ab2c (21 Feb 2017) by brian m. carlson (bk2204).

Convert the remaining uses of unsigned char [20] to struct object_id.

That is an example of an ongoing effort started with commit 5f7817c (13 Mar 2015) by brian m. carlson (bk2204), for v2.5.0-rc0, in cache.h:

/* The length in bytes and in hex digits of an object name (SHA-1 value). */
#define GIT_SHA1_RAWSZ 20
#define GIT_SHA1_HEXSZ (2 * GIT_SHA1_RAWSZ)

struct object_id {
    unsigned char hash[GIT_SHA1_RAWSZ];
};

And don't forget that, even with SHA1, the 4 first characters are no longer enough to guarantee uniqueness, as I explain in "How much of a git sha is generally considered necessary to uniquely identify a change in a given codebase?".


Update Dec. 2017 with Git 2.16 (Q1 2018): this effort to support an alternative SHA is underway: see "Why doesn't Git use more modern SHA?".

You will be able to use another hash: SHA1 is no longer the only one for Git.

Update 2018-2019: the choice has been made in Git 2.19+: SHA-256.
See "hash-function-transition".

This is not yet active (meaning git 2.21 is still using SHA1), but the code is being done to support in the future SHA-256.



来源:https://stackoverflow.com/questions/34764195/how-does-git-create-unique-commit-hashes-mainly-the-first-few-characters

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!