I am trying to extract git logs from a few repositories like this:
git log --pretty=format:%H\\t%ae\\t%an\\t%at\\t%s --numstat
For larger r
There is another avenue to increase git log performances, and it builds upon commit graphs mentioned in the previous answer.
Git 2.27 (Q2 2020) introduce an extension to the commit-graph to make it efficient to check for the paths that were modified at each commit using Bloom filters.
See commit caf388c (09 Apr 2020), and commit e369698 (30 Mar 2020) by Derrick Stolee (derrickstolee).
See commit d5b873c, commit a759bfa, commit 42e50e7, commit a56b946, commit d38e07b, commit 1217c03, commit 76ffbca (06 Apr 2020), and commit 3d11275, commit f97b932, commit ed591fe, commit f1294ea, commit f52207a, commit 3be7efc (30 Mar 2020) by Garima Singh (singhgarima).
See commit d21ee7d (30 Mar 2020) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit 9b6606f, 01 May 2020)
revision.c: use Bloom filters to speed up path based revision walks
Helped-by: Derrick Stolee
Helped-by: SZEDER Gábor
Helped-by: Jonathan Tan
Signed-off-by: Garima Singh
Revision walk will now use Bloom filters for commits to speed up revision walks for a particular path (for computing history for that path), if they are present in the commit-graph file.
We load the Bloom filters during the
prepare_revision_walkstep, currently only when dealing with a single pathspec.
Extending it to work with multiple pathspecs can be explored and built on top of this series in the future.While comparing trees in
rev_compare_trees(), if the Bloom filter says that the file is not different between the two trees, we don't need to compute the expensive diff.
This is where we get our performance gains.The other response of the Bloom filter is '`:maybe', in which case we fall back to the full diff calculation to determine if the path was changed in the commit.
We do not try to use Bloom filters when the '
--walk-reflogs' option is specified.
The '--walk-reflogs' option does not walk the commit ancestry chain like the rest of the options.
Incorporating the performance gains when walking reflog entries would add more complexity, and can be explored in a later series.
Performance Gains: We tested the performance of
git log --on the git repo, the linux and some internal large repos, with a variety of paths of varying depths.On the git and linux repos:
- we observed a 2x to 5x speed up.
On a large internal repo with files seated 6-10 levels deep in the tree:
- we observed 10x to 20x speed ups, with some paths going up to 28 times faster.
But: Fix (with Git 2.27, Q2 2020) a leak noticed by fuzzer.
See commit fbda77c (04 May 2020) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit 95875e0, 08 May 2020)
commit-graph: avoid memory leaks
Signed-off-by: Jonathan Tan
Reviewed-by: Derrick Stolee
A fuzzer running on the entry point provided by fuzz-commit-graph.c revealed a memory leak when
parse_commit_graph()creates a structbloom_filter_settingsand then returns early due to error.Fix that error by always freeing that struct first (if it exists) before returning early due to error.
While making that change, I also noticed another possible memory leak - when the
BLOOMDATAchunk is provided but notBLOOMINDEXES.
Also fix that error.
Git 2.27 (Q2 2020) improves bloom filter again:
See commit b928e48 (11 May 2020) by SZEDER Gábor (szeder).
See commit 2f6775f, commit 65c1a28, commit 8809328, commit 891c17c (11 May 2020), and commit 54c337b, commit eb591e4 (01 May 2020) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit 4b1e5e5, 14 May 2020)
bloom: de-duplicate directory entries
Signed-off-by: Derrick Stolee
When computing a changed-path Bloom filter, we need to take the files that changed from the diff computation and extract the parent directories. That way, a directory pathspec such as "
Documentation" could match commits that change "Documentation/git.txt".However, the current code does a poor job of this process.
The paths are added to a hashmap, but we do not check if an entry already exists with that path.
This can create many duplicate entries and cause the filter to have a much larger length than it should.
This means that the filter is more sparse than intended, which helps the false positive rate, but wastes a lot of space.Properly use
hashmap_get()beforehashmap_add().
Also be sure to include a comparison function so these can be matched correctly.This has an effect on a test in
t0095-bloom.sh.
This makes sense, there are ten changes inside "smallDir" so the total number of paths in the filter should be 11.
This would result in 11 * 10 bits required, and with 8 bits per byte, this results in 14 bytes.
With Git 2.28 (Q3 2020), "git log -L..." now takes advantage of the "which paths are touched by this commit?" info stored in the commit-graph system.
For that, the bloom filter is used.
See commit f32dde8 (11 May 2020) by Derrick Stolee (derrickstolee).
See commit 002933f, commit 3cb9d2b, commit 48da94b, commit d554672 (11 May 2020) by SZEDER Gábor (szeder).
(Merged by Junio C Hamano -- gitster -- in commit c3a0282, 09 Jun 2020)
line-log: integrate with
changed-pathBloom filtersSigned-off-by: Derrick Stolee
The previous changes to the line-log machinery focused on making the first result appear faster. This was achieved by no longer walking the entire commit history before returning the early results.
There is still another way to improve the performance: walk most commits much faster. Let's use the changed-path Bloom filters to reduce time spent computing diffs.Since the
line-logcomputation requires opening blobs and checking thecontent-diff, there is still a lot of necessary computation that cannot be replaced with changed-path Bloom filters.
The part that we can reduce is most effective when checking the history of a file that is deep in several directories and those directories are modified frequently.
In this case, the computation to check if a commit isTREESAMEto its first parent takes a large fraction of the time.
That is ripe for improvement with changed-path Bloom filters.We must ensure that
prepare_to_use_bloom_filters()is called in revision.c so that thebloom_filter_settingsare loaded into the structrev_infofrom the commit-graph.
Of course, some cases are still forbidden, but in theline-logcase the pathspec is provided in a different way than normal.Since multiple paths and segments could be requested, we compute the struct
bloom_keydata dynamically during the commit walk. This could likely be improved, but adds code complexity that is not valuable at this time.There are two cases to care about: merge commits and "ordinary" commits.
- Merge commits have multiple parents, but if we are TREESAME to our first parent in every range, then pass the blame for all ranges to the first parent.
- Ordinary commits have the same condition, but each is done slightly differently in the
process_ranges_[merge|ordinary]_commit()methods.By checking if the changed-path Bloom filter can guarantee TREESAME, we can avoid that tree-diff cost. If the filter says "probably changed", then we need to run the tree-diff and then the blob-diff if there was a real edit.
The Linux kernel repository is a good testing ground for the performance improvements claimed here.
There are two different cases to test:
- The first is the "entire history" case, where we output the entire history to
/dev/nullto see how long it would take to compute the full line-log history.- The second is the "first result" case, where we find how long it takes to show the first value, which is an indicator of how quickly a user would see responses when waiting at a terminal.
To test, I selected the paths that were changed most frequently in the top 10,000 commits using this command (stolen from StackOverflow):
git log --pretty=format: --name-only -n 10000 | sort | \ uniq -c | sort -rg | head -10which results in
121 MAINTAINERS 63 fs/namei.c 60 arch/x86/kvm/cpuid.c 59 fs/io_uring.c 58 arch/x86/kvm/vmx/vmx.c 51 arch/x86/kvm/x86.c 45 arch/x86/kvm/svm.c 42 fs/btrfs/disk-io.c 42 Documentation/scsi/index.rst(along with a bogus first result).
It appears that the patharch/x86/kvm/svm.cwas renamed, so we ignore that entry. This leaves the following results for the real command time:| | Entire History | First Result | | Path | Before | After | Before | After | |------------------------------|--------|--------|--------|--------| | MAINTAINERS | 4.26 s | 3.87 s | 0.41 s | 0.39 s | | fs/namei.c | 1.99 s | 0.99 s | 0.42 s | 0.21 s | | arch/x86/kvm/cpuid.c | 5.28 s | 1.12 s | 0.16 s | 0.09 s | | fs/io_uring.c | 4.34 s | 0.99 s | 0.94 s | 0.27 s | | arch/x86/kvm/vmx/vmx.c | 5.01 s | 1.34 s | 0.21 s | 0.12 s | | arch/x86/kvm/x86.c | 2.24 s | 1.18 s | 0.21 s | 0.14 s | | fs/btrfs/disk-io.c | 1.82 s | 1.01 s | 0.06 s | 0.05 s | | Documentation/scsi/index.rst | 3.30 s | 0.89 s | 1.46 s | 0.03 s |It is worth noting that the least speedup comes for the MAINTAINERS file which is:
- edited frequently,
- low in the directory hierarchy, and
- quite a large file.
All of those points lead to spending more time doing the blob diff and less time doing the tree diff.
Still, we see some improvement in that case and significant improvement in other cases.
A 2-4x speedup is likely the more typical case as opposed to the small 5% change for that file.
With Git 2.29 (Q4 2020), the changed-path Bloom filter is improved using ideas from an independent implementation.
See commit 7fbfe07, commit bb4d60e, commit 5cfa438, commit 2ad4f1a, commit fa79653, commit 0ee3cb8, commit 1df15f8, commit 6141cdf, commit cb9daf1, commit 35a9f1e (05 Jun 2020) by SZEDER Gábor (szeder).
(Merged by Junio C Hamano -- gitster -- in commit de6dda0, 30 Jul 2020)
commit-graph: simplify
parse_commit_graph()#1Signed-off-by: SZEDER Gábor
Signed-off-by: Derrick Stolee
While we iterate over all entries of the Chunk Lookup table we make sure that we don't attempt to read past the end of the mmap-ed commit-graph file, and check in each iteration that the chunk ID and offset we are about to read is still within the mmap-ed memory region. However, these checks in each iteration are not really necessary, because the number of chunks in the commit-graph file is already known before this loop from the just parsed commit-graph header.
So let's check that the commit-graph file is large enough for all entries in the Chunk Lookup table before we start iterating over those entries, and drop those per-iteration checks.
While at it, take into account the size of everything that is necessary to have a valid commit-graph file, i.e. the size of the header, the size of the mandatory OID Fanout chunk, and the size of the signature in the trailer as well.Note that this necessitates the change of the error message as well.se
And commit-graph:
The Chunk Lookup table stores the chunks' starting offset in the commit-graph file, not their sizes.
Consequently, the size of a chunk can only be calculated by subtracting its offset from the offset of the subsequent chunk (or that of the terminating label).
This is currently implemented in a bit complicated way: as we iterate over the entries of the Chunk Lookup table, we check the id of each chunk and store its starting offset, then we check the id of the last seen chunk and calculate its size using its previously saved offset.
At the moment there is only one chunk for which we calculate its size, but this patch series will add more, and the repeated chunk id checks are not that pretty.Instead let's read ahead the offset of the next chunk on each iteration, so we can calculate the size of each chunk right away, right where we store its starting offset.