Ways to improve git status performance

I have a repo of 10 GB on a Linux machine which is on NFS. The first time git status takes 36 minutes and subsequent git status takes 8 minutes. Seems Git depends on the OS for caching files. Only the first git commands like commit, status that involves pack/repack the whole repo takes a very long time for a huge repo. I am not sure if you have used git status on such a large repo, but has anyone come across this issue?

I have tried git gc, git clean, git repack but the time taken is still/almost the same.

Will sub-modules or any other concepts like breaking the repo into smaller ones help? If so which is the best for splitting a larger repo. Is there any other way to improve time taken for git commands on a large repo?

Josh Lee

To be more precise, git depends on the efficiency of the lstat(2) system call, so tweaking your client’s “attribute cache timeout” might do the trick.

The manual for git-update-index — essentially a manual mode for git-status — describes what you can do to alleviate this, by using the --assume-unchanged flag to suppress its normal behavior and manually update the paths that you have changed. You might even program your editor to unset this flag every time you save a file.

The alternative, as you suggest, is to reduce the size of your checkout (the size of the packfiles doesn’t really come into play here). The options are a sparse checkout, submodules, or Google’s repo tool.

(There’s a mailing list thread about using Git with NFS, but it doesn’t answer many questions.)

user1077329

I'm also seeing this problem on a large project shared over NFS.

It took me some time to discover the flag -uno that can be given to both git commit and git status.

What this flag does is to disable looking for untracked files. This reduces the number of nfs operations significantly. The reason is that in order for git to discover untracked files it has to look in all subdirectories so if you have many subdirectories this will hurt you. By disabling git from looking for untracked files you eliminate all these NFS operations.

Combine this with the core.preloadindex flag and you can get resonable perfomance even on NFS.

Try git gc. Also, git clean may help.

UPDATE - Not sure where the down vote came from, but the git manual specifically states:

Runs a number of housekeeping tasks within the current repository, such as compressing file revisions (to reduce disk space and increase performance) and removing unreachable objects which may have been created from prior invocations of git add.

Users are encouraged to run this task on a regular basis within each repository to maintain good disk space utilization and good operating performance.

I always notice a difference after running git gc when git status is slow!

UPDATE II - Not sure how I missed this, but the OP already tried git gc and git clean. I swear that wasn't originally there, but I don't see any changes in the edits. Sorry for that!

beno

If your git repo makes heavy use of submodules, you can greatly speed up the performance of git status by editing the config file in the .git directory and setting ignore = dirty on any particularly large/heavy submodules. For example:

[submodule "mysubmodule"]
url = ssh://mysubmoduleURL
ignore = dirty

You'll lose the convenience of a reminder that there are unstaged changes in any of the submodules that you may have forgotten about, but you'll still retain the main convenience of knowing when the submodules are out of sync with the main repo. Plus, you can still change your working directory to the submodule itself and use git status within it as per usual to see more information. See this question for more details about what "dirty" means.

The performance of git status should improve with Git 2.13 (Q2 2017).

See commit 950a234 (14 Apr 2017) by Jeff Hostetler (jeffhostetler).
^{(Merged by Junio C Hamano -- gitster -- in commit 8b6bba6, 24 Apr 2017)}

> `string-list`: use `ALLOC_GROW` macro when reallocing `string_list`

Use ALLOC_GROW() macro when reallocing a string_list array rather than simply increasing it by 32.
This is a performance optimization.

During status on a very large repo and there are many changes, a significant percentage of the total run time is spent reallocing the wt_status.changes array.

This change decreases the time in wt_status_collect_changes_worktree() from 125 seconds to 45 seconds on my very large repository.

Plus, Git 2.17 (Q2 2018) will introduce a new trace, for measuring where the time is spent in the index-heavy operations.

See commit ca54d9b (27 Jan 2018) by Nguyễn Thái Ngọc Duy (pclouds).
^{(Merged by Junio C Hamano -- gitster -- in commit 090dbea, 15 Feb 2018)}

trace: measure where the time is spent in the index-heavy operations

All the known heavy code blocks are measured (except object database access). This should help identify if an optimization is effective or not.
An unoptimized git-status would give something like below:

0.001791141 s: read cache ...
0.004011363 s: preload index
0.000516161 s: refresh index
0.003139257 s: git command: ... 'status' '--porcelain=2'
0.006788129 s: diff-files
0.002090267 s: diff-index
0.001885735 s: initialize name hash
0.032013138 s: read directory
0.051781209 s: git command: './git' 'status'

The same Git 2.17 (Q2 2018) improves git status with:

commit f39a757, commit 3ca1897, commit fd9b544, commit d7d1b49 (09 Jan 2018) by Jeff Hostetler (jeffhostetler).
^{(Merged by Junio C Hamano -- gitster -- in commit 4094e47, 08 Mar 2018)}
"git status" can spend a lot of cycles to compute the relation between the current branch and its upstream, which can now be disabled with "--no-ahead-behind" option.
commit ebbed3b (25 Feb 2018) by Derrick Stolee (derrickstolee).

revision.c: reduce object database queries

In mark_parents_uninteresting(), we check for the existence of an object file to see if we should treat a commit as parsed. The result is to set the "parsed" bit on the commit.

Modify the condition to only check has_object_file() if the result would change the parsed bit.

When a local branch is different from its upstream ref, "git status" will compute ahead/behind counts.
This uses paint_down_to_common() and hits mark_parents_uninteresting().

On a copy of the Linux repo with a local instance of "master" behind the remote branch "origin/master" by ~60,000 commits, we find the performance of "git status" went from 1.42 seconds to 1.32 seconds, for a relative difference of -7.0%.

git config --global core.preloadIndex true

Did the job for me. Check the official documentation here.

In our codebase where we have somewhere in the range of 20 - 30 submodules,
git status --ignore-submodules
sped things up for me drastically. Do note that this will not report on the status of submodules.

Something that hasn't been mentioned yet is, to activate the filesystem cache on windows machines (linux filesystems are completly different and git was optimized for them, therefore this probably only helps on windows).

git config core.fscache true

As a last resort, if git is still slow, one could turn off the modification time inspection, that git needs to find out which files have changed.

git config core.ignoreStat true

BUT: Changed files have to be added afterwards by the dev himself with git add. Git doesn't find changes itself.

source

来源：https://stackoverflow.com/questions/4994772/ways-to-improve-git-status-performance

标签

performance

git

nfs

Ways to improve git status performance

> `string-list`: use `ALLOC_GROW` macro when reallocing `string_list`

`trace`: measure where the time is spent in the index-heavy operations

`revision.c`: reduce object database queries

Ways to improve git status performance

> string-list: use ALLOC_GROW macro when reallocing string_list

trace: measure where the time is spent in the index-heavy operations

revision.c: reduce object database queries

> `string-list`: use `ALLOC_GROW` macro when reallocing `string_list`

`trace`: measure where the time is spent in the index-heavy operations

`revision.c`: reduce object database queries