What does the git index contain EXACTLY?

后端 未结 6 1366
谎友^
谎友^ 2020-11-22 17:28

What does the Git index exactly contain, and what command can I use to view the content of the index?


Update

Thanks for all your answe

6条回答
  •  谎友^
    谎友^ (楼主)
    2020-11-22 17:32

    Bit by bit analysis

    I've decided to do a little testing to better understand the format and research some of the fields in more detail.

    Results bellow are the same for Git versions 1.8.5.2 and 2.3.

    I have marked points which I'm not sure / haven't found with TODO: please feel free to complement those points.

    As others mentioned, the index is stored under .git/index, not as a standard tree object, and its format is binary and documented at: https://github.com/git/git/blob/master/Documentation/technical/index-format.txt

    The major structs that define the index are at cache.h, because the index is a cache for creating commits.

    Setup

    When we start a test repository with:

    git init
    echo a > b
    git add b
    tree --charset=ascii
    

    The .git directory looks like:

    .git/objects/
    |-- 78
    |   `-- 981922613b2afb6025042ff6bd878ac1994e85
    |-- info
    `-- pack
    

    And if we get the content of the only object:

    git cat-file -p 78981922613b2afb6025042ff6bd878ac1994e85
    

    We get a. This indicates that:

    • the index points to the file contents, since git add b created a blob object
    • it stores the metadata in the index file, not in a tree object, since there was only a single object: the blob (on regular Git objects, blob metadata is stored on the tree)

    hd analysis

    Now let's look at the index itself:

    hd .git/index
    

    Gives:

    00000000  44 49 52 43 00 00 00 02  00 00 00 01 54 09 76 e6  |DIRC.... ....T.v.|
    00000010  1d 81 6f c6 54 09 76 e6  1d 81 6f c6 00 00 08 05  |..o.T.v. ..o.....|
    00000020  00 e4 2e 76 00 00 81 a4  00 00 03 e8 00 00 03 e8  |...v.... ........|
    00000030  00 00 00 02 78 98 19 22  61 3b 2a fb 60 25 04 2f  |....x.." a;*.`%./|
    00000040  f6 bd 87 8a c1 99 4e 85  00 01 62 00 ee 33 c0 3a  |......N. ..b..3.:|
    00000050  be 41 4b 1f d7 1d 33 a9  da d4 93 9a 09 ab 49 94  |.AK...3. ......I.|
    00000060
    

    Next we will conclude:

      | 0           | 4            | 8           | C              |
      |-------------|--------------|-------------|----------------|
    0 | DIRC        | Version      | File count  | ctime       ...| 0
      | ...         | mtime                      | device         |
    2 | inode       | mode         | UID         | GID            | 2
      | File size   | Entry SHA-1                              ...|
    4 | ...                        | Flags       | Index SHA-1 ...| 4
      | ...                                                       |
    

    First comes the header, defined at: struct cache_header:

    • 44 49 52 43: DIRC. TODO: why is this necessary?

    • 00 00 00 02: format version: 2. The index format has evolved with time. Currently there exists version up to 4. The format of the index should not be an issue when collaborating between different computers on GitHub because bare repositories don't store the index: it is generated at clone time.

    • 00 00 00 01: count of files on the index: just one, b.

    Next starts a list of index entries, defined by struct cache_entry Here we have just one. It contains:

    • a bunch of file metadata: 8 byte ctime, 8 byte mtime, then 4 byte: device, inode, mode, UID and GID.

      Note how:

      • ctime and mtime are the same (54 09 76 e6 1d 81 6f c6) as expected since we haven't modified the file

        The first bytes are seconds since EPOCH in hex:

        date --date="@$(printf "%x" "540976e6")"
        

        Gives:

        Fri Sep  5 10:40:06 CEST 2014
        

        Which is when I made this example.

        The second 4 bytes are nanoseconds.

      • UID and GID are 00 00 03 e8, 1000 in hex: a common value for single user setups.

      All of this metadata, most of which is not present in tree objects, allows Git to check if a file has changed quickly without comparing the entire contents.

    • at the beginning of line 30: 00 00 00 02: file size: 2 bytes (a and \n from echo)

    • 78 98 19 22 ... c1 99 4e 85: 20 byte SHA-1 over the previous content of the entry. Note that according to my experiments with the assume valid flag, the flags that follow it are not considered in this SHA-1.

    • 2 byte flags: 00 01

      • 1 bit: assume valid flag. My investigations indicate that this poorly named flag is where git update-index --assume-unchanged stores its state: https://stackoverflow.com/a/28657085/895245

      • 1 bit extended flag. Determines if the extended flags are present or not. Must be 0 on version 2 which does not have extended flags.

      • 2 bit stage flag used during merge. Stages are documented in man git-merge:

        • 0: regular file, not in a merge conflict
        • 1: base
        • 2: ours
        • 3: theirs

        During a merge conflict, all stages from 1-3 are stored in the index to allow operations like git checkout --ours.

        If you git add, then a stage 0 is added to the index for the path, and Git will know that the conflict has been marked as solved. TODO: check this.

      • 12 bit length of the path that will follow: 0 01: 1 byte only since the path was b

    • 2 byte extended flags. Only meaningful if the "extended flag" was set on the basic flags. TODO.

    • 62 (ASCII b): variable length path. Length determined in the previous flags, here just 1 byte, b.

    Then comes a 00: 1-8 bytes of zero padding so that the path will be null-terminated and the index will end in a multiple of 8 bytes. This only happens before index version 4.

    No extensions were used. Git knows this because there would not be enough space left in the file for the checksum.

    Finally there is a 20 byte checksum ee 33 c0 3a .. 09 ab 49 94 over the content of the index.

提交回复
热议问题