Parallel Pip install

后端 未结 6 564
时光取名叫无心
时光取名叫无心 2020-12-24 03:08

Our Django project is getting huge. We have hundreds of apps and use a ton of 3rd party python packages, many of which need to have C compiled. Our deployments are taking a

相关标签:
6条回答
  • 2020-12-24 03:43

    Building on Fatal's answer, the following code does parallel Pip download, then quickly installs the packages.

    First, we download packages in parallel into a distribution ("dist") directory. This is easily run in parallel with no conflicts. Each package name is printed out is printed out before download, which helps with debugging. For extra help, change the -P9 to -P1, to download sequentially.

    After download, the next command tells Pip to install/update packages. Files are not downloaded, they're fetched from the fast local directory.

    It's compatible with the current version of Pip 1.7, also with Pip 1.5.

    To install only a subset of packages, replace the 'cat requirements.txt' statement with your custom command, e.g. 'egrep -v github requirement.txt'

    cat requirements.txt | xargs -t -n1 -P9 pip install -q --download ./dist
    
    pip install --no-index --find-links=./dist -r ./requirements.txt
    
    0 讨论(0)
  • 2020-12-24 03:43

    Will it help if you have your build system (e.g. Jenkins) build and install everything into a build-specific virtual environment directory? When the build succeeds, you make the virtual environment relocatable, tarball it and push the resulting tablall to your "released-tarballs" storage. At deploy time, you need to grab the latest tarball and unpack it on the destination host and then it should be ready to execute. So if it takes 2 seconds to download the tarball and 0.5 seconds to unpack it on the destination host, your deployment will take 2.5 seconds.

    The advantage of this approach is that all package installations happen at build time, not at deploy time.

    Caveat: your build system worker that builds/compiles/installs things into a virtual env must use same architecture as the target hardware. Also your production box provisioning system will need to take care of various C library dependencies that some Python packages may have (e.g. PIL requires that libjpeg installed before it can compile JPEG-related code, also things will break if libjpeg is not installed on the target box)

    It works well for us.

    Making a virtual env relocatable:

    virtualenv --relocatable /build/output/dir/build-1123423
    

    In this example build-1123423 is a build-specific virtual env directory.

    0 讨论(0)
  • 2020-12-24 03:46

    Parallel pip installation

    This example uses xargs to parallelize the build process by approximately 4x. You can increase the parallelization factor with max-procs below (keep it approximately equal to your number of cores).

    If you're trying to e.g. speed up an imaging process that you're doing over and over, it might be easier and definitely lower bandwidth consumption to just image directly on the result rather than do this each time, or build your image using pip -t or virtualenv.

    Download and install packages in parallel, four at a time:

    xargs --max-args=1 --max-procs=4 sudo pip install < requires.txt
    

    Note: xargs has different parameter names on different Linux distributions. Check your distribution's man page for specifics.

    Same thing inlined using a here-doc:

     cat << EOF | xargs --max-args=1 --max-procs=4 sudo pip install
     awscli
     bottle
     paste
     boto                                                                         
     wheel
     twine                                                                        
     markdown
     python-slugify
     python-bcrypt
     arrow
     redis
     psutil
     requests
     requests-aws
     EOF
    

    Warning: there is a remote possibility that the speed of this method might confuse package manifests (depending on your distribution) if multiple pip's try to install the same dependency at exactly the same time, but it's very unlikely if you're only doing 4 at a time. It could be fixed pretty easily by pip install --uninstall depname.

    0 讨论(0)
  • 2020-12-24 03:46

    I come across with a similar issue and I ended up with the below:

    cat requirements.txt | sed -e '/^\s*#.*$/d' -e '/^\s*$/d' | xargs -n 1 python -m pip install
    

    That will read line by line the requirements.txt and execute pip. I cannot find from where I got the answer properly, so apologies for that, but I found some justification below:

    1. How sed works: https://howto.lintel.in/truncate-empty-lines-using-sed/
    2. Another similar answer but with git: https://stackoverflow.com/a/46494462/7127519

    Hope this help with alternatives. I posted this solution here https://stackoverflow.com/a/63534476/7127519, so maybe there is some help there.

    0 讨论(0)
  • 2020-12-24 03:47

    Have you analyzed the deployment process to see where the time really goes? It surprises me that running multiple parallel pip processes does not speed it up much.

    If the time goes to querying PyPI and finding the packages (in particular when you also download from Github and other sources) then it may be beneficial to set up your own PyPI. You can host PyPI yourself and add the following to your requirements.txt file (docs):

    --extra-index-url YOUR_URL_HERE
    

    or the following if you wish to replace the official PyPI altogether:

    --index-url YOUR_URL_HERE
    

    This may speed up download times as all packages are now found on a nearby machine.

    A lot of time also goes into compiling packages with C code, such as PIL. If this turns out to be the bottleneck then it's worth looking into compiling code in multiple processes. You may even be able to share compiled binaries between your machines (but many things would need to be similar, such as operating system, CPU word length, et cetera)

    0 讨论(0)
  • 2020-12-24 03:49

    Inspired by Jamieson Becker's answer, I modified an install script to do parallel pip installs and it seems like and improvement. My bash script now contains a snippet like this:

    requirements=''\
    'numpy '\
    'scipy '\
    'Pillow '\
    'feedgenerator '\
    'jinja2 '\
    'docutils '\
    'argparse '\
    'pygments '\
    'Typogrify '\
    'Markdown '\
    'jsonschema '\
    'pyzmq '\
    'terminado '\
    'pandas '\
    'spyder '\
    'matplotlib '\
    'statlab '\
    'ipython[all]>=3 '\
    'ipdb '\
    ''tornado>=4' '\
    'simplepam '\
    'sqlalchemy '\
    'requests '\
    'Flask '\
    'autopep8 '\
    'python-dateutil '\
    'pylibmc '\
    'newrelic '\
    'markdown '\
    'elasticsearch '\
    "'"'docker-py==1.1.0'"'"' '\
    "'"'pycurl==7.19.5'"'"' '\
    "'"'futures==2.2.0'"'"' '\
    "'"'pytz==2014.7'"'"' '
    
    echo requirements=${requirements}
    for i in ${requirements}; do ( pip install $i > /tmp/$i.out 2>&1 & ); done
    

    I can at least look for problems manually.

    0 讨论(0)
提交回复
热议问题