I know that the history in Git is stored in a data structure called a DAG. I\'ve heard about DFS and know it\'s somewhat related.
I\'m curious, how do programs such
Note: Git 2.18 (Q2 2018) does now pre-compute and store information necessary for ancestry traversal in a separate file to optimize graph walking.
That notion of commits graph does change how 'git log --graph
' does work.
As mentioned here:
git config --global core.commitGraph true
git config --global gc.writeCommitGraph true
cd /path/to/repo
git commit-graph write
See commit 7547b95, commit 3d5df01, commit 049d51a, commit 177722b, commit 4f2542b, commit 1b70dfd, commit 2a2e32b (10 Apr 2018), and commit f237c8b, commit 08fd81c, commit 4ce58ee, commit ae30d7b, commit b84f767, commit cfe8321, commit f2af9f5 (02 Apr 2018) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit b10edb2, 08 May 2018)
You now have the command git commit-graph: Write and verify Git commit graph files.
Write a commit graph file based on the commits found in packfiles.
Includes all commits from the existing commit graph file.
The design document states:
Git walks the commit graph for many reasons, including:
- Listing and filtering commit history.
- Computing merge bases.
These operations can become slow as the commit count grows. The merge base calculation shows up in many user-facing commands, such as 'merge-base' or 'status' and can take minutes to compute depending on history shape.
There are two main costs here:
- Decompressing and parsing commits.
- Walking the entire graph to satisfy topological order constraints.
The commit graph file is a supplemental data structure that accelerates commit graph walks. If a user downgrades or disables the '
core.commitGraph
' config setting, then the existing ODB is sufficient.The file is stored as "
commit-graph
" either in the.git/objects/info
directory or in the info directory of an alternate.The commit graph file stores the commit graph structure along with some extra metadata to speed up graph walks.
By listing commit OIDs in lexicographic order, we can identify an integer position for each commit and refer to the parents of a commit using those integer positions.
We use binary search to find initial commits and then use the integer positions for fast lookups during the walk.You can see the test use cases:
git log --oneline $BRANCH git log --topo-order $BRANCH git log --graph $COMPARE..$BRANCH git branch -vv git merge-base -a $BRANCH $COMPARE
This will improve git log performance.
Git 2.19 (Q3 2018) will take care of the lock file:
See commit 33286dc (10 May 2018), commit 1472978, commit 7adf526, commit 04bc8d1, commit d7c1ec3, commit f9b8908, commit 819807b, commit e2838d8, commit 3afc679, commit 3258c66 (01 May 2018), and commit 83073cc, commit 8fb572a (25 Apr 2018) by Derrick Stolee (derrickstolee).
Helped-by: Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit a856e7d, 25 Jun 2018)
commit-graph
: fix UX issue when.lock
file exists
We use the lockfile API to avoid multiple Git processes from writing to the commit-graph file in the
.git/objects/info
directory.
In some cases, this directory may not exist, so we check for its existence.The existing code does the following when acquiring the lock:
- Try to acquire the lock.
- If it fails, try to create the
.git/object/info
directory.- Try to acquire the lock, failing if necessary.
The problem is that if the lockfile exists, then the mkdir fails, giving an error that doesn't help the user:
"fatal: cannot mkdir .git/objects/info: File exists"
While technically this honors the lockfile, it does not help the user.
Instead, do the following:
- Check for existence of
.git/objects/info
; create if necessary.- Try to acquire the lock, failing if necessary.
The new output looks like:
fatal: Unable to create '<dir>/.git/objects/info/commit-graph.lock': File exists. Another git process seems to be running in this repository, e.g. an editor opened by 'git commit'. Please make sure all processes are terminated then try again. If it still fails, a git process may have crashed in this repository earlier: remove the file manually to continue.
Note: The commit-graph facility did not work when in-core objects that are promoted from unknown type to commit (e.g. a commit that is accessed via a tag that refers to it) were involved, which has been corrected with Git 2.21 (Feb. 2019)
See commit 4468d44 (27 Jan 2019) by SZEDER Gábor (szeder).
(Merged by Junio C Hamano -- gitster -- in commit 2ed3de4, 05 Feb 2019)
That algorithm is being refactored in Git 2.23 (Q3 2019).
See commit 238def5, commit f998d54, commit 014e344, commit b2c8306, commit 4c9efe8, commit ef5b83f, commit c9905be, commit 10bd0be, commit 5af8039, commit e103f72 (12 Jun 2019), and commit c794405 (09 May 2019) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit e116894, 09 Jul 2019)
Commit 10bd0be explain the change of scope.
With Git 2.24 (Q3 2109), the code to write commit-graph
over given commit object names has been made a bit more robust.
See commit 7c5c9b9, commit 39d8831, commit 9916073 (05 Aug 2019) by SZEDER Gábor (szeder).
(Merged by Junio C Hamano -- gitster -- in commit 6ba06b5, 22 Aug 2019)
And, still with Git 2.24 (Q4 2019), the code to parse and use the commit-graph file has been made more robust against corrupted input.
See commit 806278d, commit 16749b8, commit 23424ea (05 Sep 2019) by Taylor Blau (ttaylorr).
(Merged by Junio C Hamano -- gitster -- in commit 80693e3, 07 Oct 2019)
t/t5318
: introduce failing 'git commit-graph write' tests
When invoking 'git commit-graph' in a corrupt repository, one can cause a segfault when ancestral commits are corrupt in one way or another.
This is due to two function calls in the 'commit-graph.c
' code that may returnNULL
, but are not checked for NULL-ness before dereferencing.
Hence:
commit-graph.c
: handle commit parsing errors
To write a commit graph chunk, '
write_graph_chunk_data()
' takes a list of commits to write and parses each one before writing the necessary data, and continuing on to the next commit in the list.Since the majority of these commits are not parsed ahead of time (an exception is made for the last commit in the list, which is parsed early within '
copy_oids_to_commits
'), it is possible that calling 'parse_commit_no_graph()
' on them may return an error.
Failing to catch these errors before de-referencing later calls can result in a undefined memory access and a SIGSEGV. ² One such example of this is 'get_commit_tree_oid()
', which expects a parsed object as its input (in this case, thecommit-graph
code passes '*list
').
If '*list
' causes a parse error, the subsequent call will fail.Prevent such an issue by checking the return value of 'parse_commit_no_graph()' to avoid passing an unparsed object to a function which expects a parsed object, thus preventing a segfault.
With Git 2.26 (Q1 2020), the code to compute the commit-graph has been taught to use a more robust way to tell if two object directories refer to the same thing.
See commit a7df60c, commit ad2dd5b, commit 13c2499 (03 Feb 2020), commit 0bd52e2 (04 Feb 2020), and commit 1793280 (30 Jan 2020) by Taylor Blau (ttaylorr).
(Merged by Junio C Hamano -- gitster -- in commit 53c3be2, 14 Feb 2020)
commit-graph.h: store an odb in 'struct
write_commit_graph_context
'Signed-off-by: Taylor Blau
There are lots of places in commit-graph.h where a function either has (or almost has) a full
struct
object_directory *, accesses
->path`, and then throws away the rest of the struct.This can cause headaches when comparing the locations of object directories across alternates (e.g., in the case of deciding if two commit-graph layers can be merged).
These paths are normalized withnormalize_path_copy()
which mitigates some comparison issues, but not all 1.Replace usage of
char *object_dir
withodb->path
by storing astruct object_directory*
in thewrite_commit_graph_context
structure.
This is an intermediate step towards getting rid of all path normalization in 'commit-graph.c'.Resolving a user-provided '
--object-dir
' argument now requires that we compare it to the known alternates for equality.Prior to this patch, an unknown '
--object-dir
' argument would silently exit with status zero.This can clearly lead to unintended behavior, such as verifying commit-graphs that aren't in a repository's own object store (or one of its alternates), or causing a typo to mask a legitimate commit-graph verification failure.
Make this error non-silent by 'die()
'-ing when the given '--object-dir
' does not match any known alternate object store.
With Git 2.28 (Q3 2020), the commit-graph write --stdin-commits
is optmized.
See commit 2f00c35, commit 1f1304d, commit 0ec2d0f, commit 5b6653e, commit 630cd51, commit d335ce8 (13 May 2020), commit fa8953c (18 May 2020), and commit 1fe1084 (05 May 2020) by Taylor Blau (ttaylorr).
(Merged by Junio C Hamano -- gitster -- in commit dc57a9b, 09 Jun 2020)
commit-graph: drop
COMMIT_GRAPH_WRITE_CHECK_OIDS
flagHelped-by: Jeff King
Signed-off-by: Taylor Blau
Since 7c5c9b9c57 ("
commit-graph
: error out on invalid commit oids in 'write --stdin-commits
'", 2019-08-05, Git v2.24.0-rc0 -- merge listed in batch #1), the commit-graph builtin dies on receiving non-commit OIDs as input to '--stdin-commits
'.This behavior can be cumbersome to work around in, say, the case of piping '
git for-each-ref
' to 'git commit-graph write --stdin-commits' if the caller does not want to cull out non-commits themselves. In this situation, it would be ideal if 'git commit-graph write' wrote the graph containing the inputs that did pertain to commits, and silently ignored the remainder of the input.Some options have been proposed to the effect of '
--[no-]check-oids
' which would allow callers to have the commit-graph builtin do just that.
After some discussion, it is difficult to imagine a caller who wouldn't want to pass '--no-check-oids
', suggesting that we should get rid of the behavior of complaining about non-commit inputs altogether.If callers do wish to retain this behavior, they can easily work around this change by doing the following:
git for-each-ref --format='%(objectname) %(objecttype) %(*objecttype)' | awk ' !/commit/ { print "not-a-commit:"$1 } /commit/ { print $1 } ' | git commit-graph write --stdin-commits
To make it so that valid OIDs that refer to non-existent objects are indeed an error after loosening the error handling, perform an extra lookup to make sure that object indeed exists before sending it to the commit-graph internals.
This is tested with Git 2.28 (Q3 2020).
See commit 94fbd91 (01 Jun 2020), and commit 6334c5f (03 Jun 2020) by Taylor Blau (ttaylorr).
(Merged by Junio C Hamano -- gitster -- in commit abacefe, 18 Jun 2020)
t5318: test that '
--stdin-commits
' respects '--[no-]progress
'Signed-off-by: Taylor Blau
Acked-by: Derrick Stolee
The following lines were not covered in a recent line-coverage test against Git:
builtin/commit-graph.c 5b6653e5 244) progress = start_delayed_progress( 5b6653e5 268) stop_progress(&progress);
These statements are executed when both '
--stdin-commits
' and '--progress
' are passed. Introduce a trio of tests that exercise various combinations of these options to ensure that these lines are covered.More importantly, this is exercising a (somewhat) previously-ignored feature of '
--stdin-commits
', which is that it respects '--progress
'.Prior to 5b6653e523 ("
[
builtin/commit-graph.c](https
://github.com/git/git/blob/94fbd9149a2d59b0dca18448ef9d3e0607a7a19d/builtin/commit-graph.c): dereference tags in builtin", 2020-05-13, Git v2.28.0 -- merge listed in batch #2), dereferencing input from '--stdin-commits
' was done inside of commit-graph.c.Now that an additional progress meter may be generated from outside of commit-graph.c, add a corresponding test to make sure that it also respects '
--[no]-progress
'.The other location that generates progress meter output (from d335ce8f24 ("
[
commit-graph.c](https
://github.com/git/git/blob/94fbd9149a2d59b0dca18448ef9d3e0607a7a19d/commit-graph.c): show progress of finding reachable commits", 2020-05-13, Git v2.28.0 -- merge listed in batch #2)) is already covered by any test that passes '--reachable
'.
With Git 2.29 (Q4 2020), in_merge_bases_many(), a way to see if a commit is reachable from any commit in a set of commits, was totally broken when the commit-graph feature was in use, which has been corrected.
See commit 8791bf1 (02 Oct 2020) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit c01b041, 05 Oct 2020)
commit-reach: fix
in_merge_bases_many
bugReported-by: Srinidhi Kaushik
Helped-by: Johannes Schindelin
Signed-off-by: Derrick Stolee
Way back in f9b8908b ("
[
commit.c](https
://github.com/git/git/blob/8791bf18414a37205127e184c04cad53a43aeff1/commit.c): use generation numbers forin_merge_bases()
", 2018-05-01, Git v2.19.0-rc0 -- merge listed in batch #1), a heuristic was used to short-circuit thein_merge_bases()
walk.
This works just fine as long as the caller is checking only two commits, but when there are multiple, there is a possibility that this heuristic is very wrong.Some code moves since then has changed this method to
repo_in_merge_bases_many()
inside commit-reach.c. The heuristic computes the minimum generation number of the "reference" list, then compares this number to the generation number of the "commit".In a recent topic, a test was added that used
in_merge_bases_many()
to test if a commit was reachable from a number of commits pulled from a reflog. However, this highlighted the problem: if any of the reference commits have a smaller generation number than the given commit, then the walk is skipped_even
if there exist some with higher generation number_.This heuristic is wrong! It must check the MAXIMUM generation number of the reference commits, not the MINIMUM.
The fix itself is to swap
min_generation
with amax_generation
inrepo_in_merge_bases_many()
.
I tried looking around Git or hg's code but it's very hard to follow and get a general idea of what's going on.
For hg, did you try to follow the code in hg itself, or in graphlog?
Because the code of graphlog is pretty short. You can find it in hgext/graphlog.py, and really the important part is the top ~200 lines, the rest is the extension's bootstrapping and finding the revision graph selected. The code generation function is ascii
, with its last parameter being the result of a call to asciiedge
(the call itself is performed on the last line of generate
, the function being provided to generate
by graphlog
)
First, one obtains a list of commits (as with git rev-list
), and parents of each commit. A "column reservation list" is kept in memory.
For each commit then:
Example showing output of git-forest
on aufs2-util with an extra commit to have more than one branch).
With lookahead, one can anticipate how far down the merge point will be and squeeze the wood between two columns to give a more aesthetically pleasing result.
This particular problem isn't that hard, compared to graph display in general. Because you want to keep the nodes in the order they were committed the problem gets much simpler.
Also note that the display model is grid based, rows are commits and columns are edges into the past/future.
While I didn't read the git source you probably just walk the list of commits, starting from the newest, and maintain a list of open edges into the past. Following the edges naturally leads to splitting/merging columns and you end up with the kind of tree git/hg display.
When merging edges you want to avoid crossing other edges, so you'll have to try to order your columns ahead of time. This is actally the only part that may not be straightforward. For example one could do a two-pass algorithm, making up a column order for the edges in the first pass and doing the drawing in the second pass.