Is repacking a repository useful for large binaries?

梦想的初衷 提交于 2019-12-08 05:36:12

问题


I'm trying to convert a large history from Perforce to Git, and one folder (now git branch) contains a significant number of large binary files. My problem is that I'm running out of memory while running git gc --aggressive.

My primary question here is whether repacking the repository is likely to have any meaningful effect on large binaries. Compressing them another 20% would be great. 0.2% isn't worth my effort. If not, I'll have them skipped over as suggested here.

For background, I successfully used git p4 to create the repository in a state I'm happy with, but this uses git fast-import behind the scenes so I want to optimize the repository before making it official, and indeed making any commits automatically triggered a slow gc --auto. It's currently ~35GB in a bare state.

The binaries in question seem to be, conceptually, the vendor firmware used in embedded devices. I think there are approximately 25 in the 400-700MB range and maybe a couple hundred more in the 20-50MB range. They might be disk images, but I'm unsure of that. There's a variety of versions and file types over time, and I see .zip, tgz, and .simg files frequently. As such, I'd expect the raw code to have significant overlap, but I'm not sure how similar the actual files appear at this point, as I believe these formats have already been compressed, right?

These binaries are contained in one (old) branch that will be used excessively rarely (to the point questioning version control at all is valid, but out of scope). Certainly the performance of that branch does not need to be great. But I'd like the rest of the repository to be reasonable.

Other suggestions for optimal packing or memory management are welcome. I admit I don't really understand the various git options being discussed on the linked question. Nor do I really understand what the --window and --depth flags are doing in git repack. But the primary question is whether the repacking of the binaries themselves is doing anything meaningful.


回答1:


My primary question here is whether repacking the repository is likely to have any meaningful effect on large binaries.

That depends on their contents. For the files you've outlined specifically:

I see .zip, tgz, and .simg files frequently.

Zipfiles and tgz (gzipped tar archive) files are already compressed and have terrible (i.e., high) Shannon entropy values—terrible for Git that is—and will not compress against each other. The .simg files are probably (I have to guess here) Singularity disk image files; whether and how they are compressed, I don't know, but I would assume they are. (An easy test is to feed one to a compressor, e.g., gzip, and see if it shrinks.)

As such, I'd expect the raw code to have significant overlap, but I'm not sure how similar the actual files appear at this point, as I believe these formats have already been compressed, right?

Precisely. Storing them uncompressed in Git would thus, paradoxically, result in far greater compression in the end. (But the packing could require significant amounts of memory.)

If [this is probably futile], I'll have them skipped over as suggested here.

That would be my first impulse here. :-)

I admit I don't really understand the various git options being discussed on the linked question. Nor do I really understand what the --window and --depth flags are doing in git repack.

The various limits are confusing (and profuse). It's also important to realize that they don't get copied on clone, since they are in .git/config which is not a committed file, so new clones won't pick them up. The .gitattributes file is copied on clone and new clones will continue to avoid packing unpackable files, so it's the better approach here.

(If you care to dive into the details, you will find some in the Git technical documentation. This does not discuss precisely what the window sizes are about, but it has to do with how much memory Git uses to memory-map object data when selecting objects that might compress well against each other. There are two: one for each individual mmap on one pack file, and one for the total aggregate mmap on all pack files. Not mentioned on your link: core.deltaBaseCacheLimit, which is how much memory will be used to hold delta bases—but to understand this you need to grok delta compression and delta chains,1 and read that same technical documentation. Note that Git will default to not attempting to pack any file object whose size exceeds core.bigFileThreshold. The various pack.* controls are a bit more complex: the packing is done multi-threaded to take advantage of all your CPUs if possible, and each thread can use a lot of memory. Limiting the number of threads limits total memory use: if one thread is going to use 256 MB, 8 threads are likely to use 8*256 = 2048 MB or 2 GB. The bitmaps mainly speed up fetching from busy servers.)


1They're not that complicated: a delta chain occurs when one object says "take object XYZ and apply these changes", but object XYZ itself says "take object PreXYZ and apply these changes". Object PreXYZ can also take another object, and so on. The delta base is the object at the bottom of this list.




回答2:


Other suggestions for optimal packing or memory management are welcome.

Git 2.20 (Q4 2018) will have one: When there are too many packfiles in a repository (which is not recommended), looking up an object in these would require consulting many pack .idx files; a new mechanism to have a single file that consolidates all of these .idx files is introduced.

See commit 6a22d52, commit e9ab2ed, commit 454ea2e, commit 0bff526, commit 29e2016, commit fe86c3b, commit c39b02a, commit 2cf489a, commit 6d68e6a (20 Aug 2018), commit ceab693 (12 Jul 2018) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit 49f210f, 17 Sep 2018)

pack-objects: consider packs in multi-pack-index

When running 'git pack-objects --local', we want to avoid packing objects that are in an alternate.
Currently, we check for these objects using the packed_git_mru list, which excludes the pack-files covered by a multi-pack-index.

There is a new setting:

core.multiPackIndex::

Use the multi-pack-index file to track multiple packfiles using a single index.

And that multi-pack index is explained here and in Documentation/technical/multi-pack-index.txt:

Multi-Pack-Index (MIDX) Design Notes

The Git object directory contains a 'pack' directory containing:

  • packfiles (with suffix ".pack") and
  • pack-indexes (with suffix ".idx").

The pack-indexes provide a way to lookup objects and navigate to their offset within the pack, but these must come in pairs with the packfiles.
This pairing depends on the file names, as the pack-index differs only in suffix with its pack-file.

While the pack-indexes provide fast lookup per packfile, this performance degrades as the number of packfiles increases, because abbreviations need to inspect every packfile and we are more likely to have a miss on our most-recently-used packfile.

For some large repositories, repacking into a single packfile is not feasible due to storage space or excessive repack times.

The multi-pack-index (MIDX for short) stores a list of objects and their offsets into multiple packfiles.
It contains:

  • A list of packfile names.
  • A sorted list of object IDs.
  • A list of metadata for the ith object ID including:
    • A value j referring to the jth packfile.
    • An offset within the jth packfile for the object.
  • If large offsets are required, we use another list of large offsets similar to version 2 pack-indexes.

Thus, we can provide O(log N) lookup time for any number of packfiles.


Git 2.23 (Q3 2019) adds two commands, with "git multi-pack-index" learning the expire and repack subcommands.

See commit 3612c23 (01 Jul 2019), and commit b526d8c, commit 10bfa3f, commit d274331, commit ce1e4a1, commit 2af890b, commit 19575c7, commit d01bf2e, commit dba6175, commit cff9711, commit 81efa16, commit 8434e85 (10 Jun 2019) by Derrick Stolee (derrickstolee).
Helped-by: Johannes Schindelin (dscho).
(Merged by Junio C Hamano -- gitster -- in commit 4308d81, 19 Jul 2019)

multi-pack-index: prepare for/implement 'expire' subcommand

The multi-pack-index tracks objects in a collection of pack-files.
Only one copy of each object is indexed, using the modified time of the pack-files to determine tie-breakers.
It is possible to have a pack-file with no referenced objects because all objects have a duplicate in a newer pack-file.

Introduce a new 'expire' subcommand to the multi-pack-index builtin.
This subcommand will delete these unused pack-files and rewrite the multi-pack-index to no longer refer to those files
.

The 'git multi-pack-index expire' subcommand:

  • looks at the existing multi-pack-index,
  • counts the number of objects referenced in each pack-file,
  • deletes the pack-fils with no referenced objects, and
  • rewrites the multi-pack-index to no longer reference those packs.

Documentation:

expire:

Delete the pack-files that are tracked by the MIDX file, but have no objects referenced by the MIDX. Rewrite the MIDX file afterward to remove all references to these pack-files.

And:

multi-pack-index: prepare/implement 'repack' subcommand

In an environment where the multi-pack-index is useful, it is due to many pack-files and an inability to repack the object store into a single pack-file. However, it is likely that many of these pack-files are rather small, and could be repacked into a slightly larger pack-file without too much effort.
It may also be important to ensure the object store is highly available and the repack operation does not interrupt concurrent git commands.

Introduce a 'repack' subcommand to 'git multi-pack-index' that takes a '--batch-size' option.

The subcommand will inspect the multi-pack-index for referenced pack-files whose size is smaller than the batch size, until collecting a list of pack-files whose sizes sum to larger than the batch size.
Then, a new pack-file will be created containing the objects from those pack-files that are referenced by the multi-pack-index.

The resulting pack is likely to actually be smaller than the batch size due to compression and the fact that there may be objects in the pack-files that have duplicate copies in other pack-files.

The 'git multi-pack-index repack' command can take a batch size of zero, which creates a new pack-file containing all objects in the multi-pack-index.

Using a batch size of zero is very similar to a standard 'git repack' command, except that we do not delete the old packs and instead rely on the new multi-pack-index to prevent new processes from reading the old packs.
This does not disrupt other Git processes that are currently reading the old packs based on the old multi-pack-index.

The first 'repack' command will create one new pack-file, and an 'expire' command after that will delete the old pack-files, as they no longer contain any referenced objects in the multi-pack-index.

Documentation:

repack:

Create a new pack-file containing objects in small pack-files referenced by the multi-pack-index.
If the size given by the --batch-size=<size> argument is zero, then create a pack containing all objects referenced by the multi-pack-index.

For a non-zero batch size:

  • select the pack-files by examining packs from oldest-to-newest,
  • computing the "expected size" by counting the number of objects in the pack referenced by the multi-pack-index,
  • then divide by the total number of objects in the pack and
  • multiply by the pack size.

We select packs with expected size below the batch size until the set of packs have total expected size at least the batch size.

  • If the total size does not reach the batch size, then do nothing.
  • If a new pack-file is created, rewrite the multi-pack-index to reference the new pack-file.
    A later run of 'git multi-pack-index expire' will delete the pack-files that were part of this batch.

With Git 2.25 (Q1 2020), the code to generate multi-pack index learned to show (or not to show) progress indicators.

That can be useful for large binaries.

See commit 680cba2, commit 64d80e7, commit ad60096, commit 8dc18f8, commit 840cef0, commit efbc3ae (21 Oct 2019) by William Baker (wjbaker101).
(Merged by Junio C Hamano -- gitster -- in commit 8f1119b, 10 Nov 2019)

multi-pack-index: add [--[no-]progress] option.

Signed-off-by: William Baker

Add the --[no-]progress option to git multi-pack-index.
Pass the MIDX_PROGRESS flag to the subcommand functions when progress should be displayed by multi-pack-index.

The progress feature was added to 'verify' in 144d703 ("multi-pack-index: report progress during 'verify'", 2018-09-13, Git v2.20.0-rc0 -- merge listed in batch #3) but some subcommands were not updated to display progress, and the ability to opt-out was overlooked.



来源:https://stackoverflow.com/questions/50996930/is-repacking-a-repository-useful-for-large-binaries

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!