How to speed up / parallelize downloads of git submodules using git clone --recursive?

ぃ、小莉子 提交于 2019-11-27 22:10:35
Anthon

When I run your command it takes 338 seconds wall-time for downloading the 68 Mb.

With the following Python program that relies on GNU parallel to be installed,

#! /usr/bin/env python
# coding: utf-8

from __future__ import print_function

import os
import subprocess

jobs=16

modules_file = '.gitmodules'

packages = []

if not os.path.exists('Whonix/' + modules_file):
    subprocess.call(['git', 'clone', 'https://github.com/Whonix/Whonix'])

os.chdir('Whonix')

# get list of packages from .gitmodules file
with open(modules_file) as ifp:
    for line in ifp:
        if not line.startswith('[submodule '):
            continue
        package = line.split(' "', 1)[1].split('"', 1)[0]
        #print(package)
        packages.append(package)

def doit():
    p = subprocess.Popen(['parallel', '-N1', '-j{0}'.format(jobs),
                          'git', 'submodule', 'update', '--init',
                          ':::'],
                         stdin=subprocess.PIPE, stdout=subprocess.PIPE)
    res = p.communicate('\n'.join(packages))
    print(res[0])
    if res[1]:
        print("error", res[1])
    print('git exit value', p.returncode)
    return p.returncode

# sometimes one of the updates interferes with the others and generate lock
# errors, so we retry
for x in range(10):
    if doit() == 0:
        print('zero exit from git after {0} times'.format(x+1))
        break
else:
    print('could not get a non-zero exit from git after {0} times'.format(
          x+1))

that time is reduced to 45 seconds (on the same system, I did not do multiple runs to average out fluctuations).

To check if things were OK, I "compared" the checked out files with:

find Whonix -name ".git" -prune -o -type f -print0 | xargs -0 md5sum > /tmp/md5.sum

in the one directory and

md5sum -c /tmp/md5sum 

in the other directory and vice versa.

With git 2.8 (Q12016), you will be able to initiate the fetch of submodules... in parallel!

See commit fbf7164 (16 Dec 2015) by Jonathan Nieder (artagnon).
See commit 62104ba, commit fe85ee6, commit c553c72, commit bfb6b53, commit b4e04fb, commit 1079c4b (16 Dec 2015) by Stefan Beller (stefanbeller).
(Merged by Junio C Hamano -- gitster -- in commit 187c0d3, 12 Jan 2016)

Add a framework to spawn a group of processes in parallel, and use it to run "git fetch --recurse-submodules" in parallel.

For that, git fetch has the new option:

-j, --jobs=<n>

Number of parallel children to be used for fetching submodules.
Each will fetch from different submodules, such that fetching many submodules will be faster.
By default submodules will be fetched one at a time.

Example:

git fetch --recurse-submodules -j2

The bulk of this new feature is in commit c553c72 (16 Dec 2015) by Stefan Beller (stefanbeller).

run-command: add an asynchronous parallel child processor

This allows to run external commands in parallel with ordered output on stderr.

If we run external commands in parallel we cannot pipe the output directly to the our stdout/err as it would mix up. So each process's output will flow through a pipe, which we buffer. One subprocess can be directly piped to out stdout/err for a low latency feedback to the user.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!