Git objects SHA-1 are file contents or file names?

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-07 10:01:32

问题


I am confused with how a file's actual contents are stored in .git.

For e.g. Version 1 is the actual text content in test.txt. When I commit (first commit) it to the repo, git returns a SHA-1 for that file which is located in .git\objects\0c\15af113a95643d7c244332b0e0b287184cd049.

When I open the file 15af113a95643d7c244332b0e0b287184cd049 in a text editor, it's all garbage, something like this

x+)JMU074f040031QÐKÏ,ÉLÏË/Je¨}ºõw[Éœ„ÇR­ ñ·Î}úyGª*±8#³¨,1%>9?¯$5¯D¯¤¢„áôÏ3%³þú>š~}Ž÷*ë²-¶ç¡êÊòR“KâKòãs+‹sô

But I'm not sure whether this garbage represents the encrypted form of the text Version 1 or it's represented by the SHA-1 15af113a95643d7c244332b0e0b287184cd049.


回答1:


The correct answer to the question in the subject line:

Git objects SHA-1 are file contents or file names?

is probably "neither", since you were referring to the contents of the loose object file, rather than the original file—and even if you were referring to the original file, that's still not quite right.

A loose object, in Git, is a plain file. The name of the file is constructed from the object's hash ID. The object's hash ID, in turn, is constructed by computing a hash of the object's contents with a prefix header attached.

The prefixed header depends on the object type. There are four types: blob, commit, tag, and tree. The header consists of the a zero-terminated byte string composed of the type name as an ASCII (or equivalently, UTF-8) byte string, followed by a space, followed by a decimalized representation of the size of the object in bytes, followed by an ASCII NUL (b'\x00' in Python, if you prefer modern Python notation, or '\0' if you prefer C).

After the header come the actual object contents. So, for a file containing the byte string b'hello\n', the data to be hashed consist of b'blob 6\0hello\n:

$ echo 'hello' | git hash-object -t blob --stdin
ce013625030ba8dba906f756967f9e9ca394464a
$ python3
[...]
>>> import hashlib
>>> s = b'blob 6\0hello\n'
>>> hashlib.sha1(s).hexdigest()
'ce013625030ba8dba906f756967f9e9ca394464a'

Hence, the file name that would be used to store this file is (derived from) ce013625030ba8dba906f756967f9e9ca394464a. As a loose object, it becomes .git/objects/ce/013625030ba8dba906f756967f9e9ca394464a.

The contents of that file, however, are the zlib-compressed form of b'blob 6\0hello\n' (with, apparently, level=1—the default is currently 6 and the result does not match at that level; it's not clear whether Git's zlib deflate exactly matches Python's, but using level 1 did work here):

$ echo 'hello' | git hash-object -w -t blob --stdin
ce013625030ba8dba906f756967f9e9ca394464a
$ vis .git/objects/ce/013625030ba8dba906f756967f9e9ca394464a
x\^AK\M-J\M-IOR0c\M-HH\M-M\M-I\M-I\M-g\^B\000\^]\M-E\^D\^T$

(note that the final $ is the shell prompt again; now back to Python3)

>>> import zlib
>>> zlib.compress(s, 1)
b'x\x01K\xca\xc9OR0c\xc8H\xcd\xc9\xc9\xe7\x02\x00\x1d\xc5\x04\x14'
>>> import vis
>>> print(vis.vis(zlib.compress(s, 1)))
x\^AK\M-J\M-IOR0c\M-HH\M-M\M-I\M-I\M-g\^B\^@\^]\M-E\^D\^T

where vis.py is:

def vischr(byte):
    "encode characters the way vis(1) does by default"
    if byte in b' \t\n':
        return chr(byte)
    # control chars: \^X; del: \^?
    if byte < 32 or byte == 127:
        return r'\^' + chr(byte ^ 64)
    # printable characters, 32..126
    if byte < 128:
        return chr(byte)
    # meta characters: prefix with \M^ or \M-
    byte -= 128
    if byte < 32 or byte == 127:
        return r'\M^' + chr(byte ^ 64)
    return r'\M-' + chr(byte)

def vis(bytestr):
    "same as vis(1)"
    return ''.join(vischr(c) for c in bytestr)

(vis produces an invertible but printable encoding of binary files; it was my 1993-ish answer to problems with cat -v).

Note that the names of files stored in a Git repository (under a commit) appear only as path name components stored in individual tree objects. Computing the hash ID of a tree object is nontrivial; I have Python code that does this in my public "scripts" repository under githash.py.




回答2:


Git Magic mentions:

By the way, the files within .git/objects are compressed with zlib so you should not stare at them directly. Filter them through zpipe -d, or type (using git cat-file):

$ git cat-file -p .git/objects/0c/15af113a95643d7c244332b0e0b287184cd049

With zpipe:

$ ./zpipe -d < .git/objects/0c/15af113a95643d7c244332b0e0b287184cd049

Note: for zpipe, I had to compile zpipe.c first:

sudo apt-get install zlib1g-dev
cd /usr/share/doc/zlib1g-dev/examples
sudo gunzip zpipe.c.gz
sudo gcc -o zpipe zpipe.c -lz

Then:

$ /usr/share/doc/zlib1g-dev/examples/zpipe -d < /usr/share/doc/zlib1g-dev/examples/zpipe -d <

You will get a result like:

vonc@VONCAVN7:/mnt/d/git/seec$ /usr/share/doc/zlib1g-dev/examples/zpipe -d < .git/objects/0d/b6225927ef60e21138a9762c41ea0db714ca0d
blob 2142 <full content there...>

You see a header composed of the type and content size, followed by the actual content.

See "Understanding Git Internals" from Jeff Kunkle, slide 8, for an illustration of a blob actual content:



来源:https://stackoverflow.com/questions/44475891/git-objects-sha-1-are-file-contents-or-file-names

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!