What is the file format of a git commit object data structure?

前端 未结 3 1228
半阙折子戏
半阙折子戏 2020-12-01 16:27

Context: I was hoping to be able to search through my git commit messages and commits without having to go through the puzzlingly complex git grep command, so I decided to s

3条回答
  •  情歌与酒
    2020-12-01 16:56

    Create a minimal example and reverse engineer the format

    Create a simple repository, and before any packfiles are created (git gc, git config gc.auto, git-prune-packed ...), unpack a commit object with one of the methods from: How to DEFLATE with a command line tool to extract a git object?

    export GIT_AUTHOR_DATE="1970-01-01T00:00:00+0000"
    export GIT_AUTHOR_EMAIL="author@example.com"
    export GIT_AUTHOR_NAME="Author Name" \
    export GIT_COMMITTER_DATE="2000-01-01T00:00:00+0000" \
    export GIT_COMMITTER_EMAIL="committer@example.com" \
    export GIT_COMMITTER_NAME="Committer Name" \
    
    git init
    
    # First commit.
    echo
    touch a
    git add a
    git commit -m 'First message'
    python -c "import zlib,sys;sys.stdout.write(zlib.decompress(sys.stdin.read()))" \
      <.git/objects/45/3a2378ba0eb310df8741aa26d1c861ac4c512f | hd
    
    # Second commit.
    echo
    touch b
    git add b
    git commit -m 'Second message'
    python -c "import zlib,sys;sys.stdout.write(zlib.decompress(sys.stdin.read()))" \
      <.git/objects/74/8e6f7e22cac87acec8c26ee690b4ff0388cbf5 | hd
    

    The output is:

    Initialized empty Git repository in /home/ciro/test/git/.git/
    
    [master (root-commit) 453a237] First message
     Author: Author Name 
     1 file changed, 0 insertions(+), 0 deletions(-)
     create mode 100644 a
    00000000  63 6f 6d 6d 69 74 20 31  37 34 00 74 72 65 65 20  |commit 174.tree |
    00000010  34 39 36 64 36 34 32 38  62 39 63 66 39 32 39 38  |496d6428b9cf9298|
    00000020  31 64 63 39 34 39 35 32  31 31 65 36 65 31 31 32  |1dc9495211e6e112|
    00000030  30 66 62 36 66 32 62 61  0a 61 75 74 68 6f 72 20  |0fb6f2ba.author |
    00000040  41 75 74 68 6f 72 20 4e  61 6d 65 20 3c 61 75 74  |Author Name |
    00000060  20 30 20 2b 30 30 30 30  0a 63 6f 6d 6d 69 74 74  | 0 +0000.committ|
    00000070  65 72 20 43 6f 6d 6d 69  74 74 65 72 20 4e 61 6d  |er Committer Nam|
    00000080  65 20 3c 63 6f 6d 6d 69  74 74 65 72 40 65 78 61  |e  946684|
    000000a0  38 30 30 20 2b 30 30 30  30 0a 0a 46 69 72 73 74  |800 +0000..First|
    000000b0  20 6d 65 73 73 61 67 65  0a                       | message.|
    000000ba
    
    [master 748e6f7] Second message
     Author: Author Name 
     1 file changed, 0 insertions(+), 0 deletions(-)
     create mode 100644 b
    00000000  63 6f 6d 6d 69 74 20 32  32 33 00 74 72 65 65 20  |commit 223.tree |
    00000010  32 39 36 65 35 36 30 32  33 63 64 63 30 33 34 64  |296e56023cdc034d|
    00000020  32 37 33 35 66 65 65 38  63 30 64 38 35 61 36 35  |2735fee8c0d85a65|
    00000030  39 64 31 62 30 37 66 34  0a 70 61 72 65 6e 74 20  |9d1b07f4.parent |
    00000040  34 35 33 61 32 33 37 38  62 61 30 65 62 33 31 30  |453a2378ba0eb310|
    00000050  64 66 38 37 34 31 61 61  32 36 64 31 63 38 36 31  |df8741aa26d1c861|
    00000060  61 63 34 63 35 31 32 66  0a 61 75 74 68 6f 72 20  |ac4c512f.author |
    00000070  41 75 74 68 6f 72 20 4e  61 6d 65 20 3c 61 75 74  |Author Name |
    00000090  20 30 20 2b 30 30 30 30  0a 63 6f 6d 6d 69 74 74  | 0 +0000.committ|
    000000a0  65 72 20 43 6f 6d 6d 69  74 74 65 72 20 4e 61 6d  |er Committer Nam|
    000000b0  65 20 3c 63 6f 6d 6d 69  74 74 65 72 40 65 78 61  |e  946684|
    000000d0  38 30 30 20 2b 30 30 30  30 0a 0a 53 65 63 6f 6e  |800 +0000..Secon|
    000000e0  64 20 6d 65 73 73 61 67  65 0a                    |d message.|
    000000eb
    

    Then we deduce that the format is as follows:

    • Top level:

      commit {size}\0{content}
      

      where {size} is the number of bytes in {content}.

      This follows the same pattern for all object types.

    • {content}:

      tree {tree_sha}
      {parents}
      author {author_name} <{author_email}> {author_date_seconds} {author_date_timezone}
      committer {committer_name} <{committer_email}> {committer_date_seconds} {committer_date_timezone}
      
      {commit message}
      

      where:

      • {tree_sha}: SHA of the tree object this commit points to.

        This represents the top-level Git repo directory.

        That SHA comes from the format of the tree object: What is the internal format of a git tree object?

      • {parents}: optional list of parent commit objects of form:

        parent {parent1_sha}
        parent {parent2_sha}
        ...
        

        The list can be empty if there are no parents, e.g. for the first commit in a repo.

        Two parents happen in regular merge commits.

        More than two parents are possible with git merge -Xoctopus, but this is not a common workflow. Here is an example: https://github.com/cirosantilli/test-octopus-100k

      • {author_name}: e.g.: Ciro Santilli. Cannot contain <, \n

      • {author_email}: e.g.: cirosantilli@mail.com. Cannot contain >, \n

      • {author_date_seconds}: seconds since 1970, e.g. 946684800 is the first second of year 2000

      • {author_date_timezone}: e.g.: +0000 is UTC

      • committer fields: analogous to author fields

      • {commit message}: arbitrary.

    I've made a minimal Python script that generates a git repo with a few commits at: https://github.com/cirosantilli/test-git-web-interface/blob/864d809c36b8f3b232d5b0668917060e8bcba3e8/other-test-repos/util.py#L83

    I've used that for fun things like:

    • Who is the user with the longest streak on GitHub?
    • https://www.quora.com/Which-GitHub-repo-has-the-most-commits/answer/Ciro-Santilli
    • https://github.com/isaacs/github/issues/1344

    Here is an analogous analysis of the tag object format: What is the format of a git tag object and how to calculate its SHA?

提交回复
热议问题