Parallel Pip install

后端未结

关注

 6  573

Our Django project is getting huge. We have hundreds of apps and use a ton of 3rd party python packages, many of which need to have C compiled. Our deployments are taking a

相关标签:

6条回答

后悔当初

2020-12-24 03:43
Building on Fatal's answer, the following code does parallel Pip download, then quickly installs the packages.

First, we download packages in parallel into a distribution ("dist") directory. This is easily run in parallel with no conflicts. Each package name is printed out is printed out before download, which helps with debugging. For extra help, change the -P9 to -P1, to download sequentially.

After download, the next command tells Pip to install/update packages. Files are not downloaded, they're fetched from the fast local directory.

It's compatible with the current version of Pip 1.7, also with Pip 1.5.

To install only a subset of packages, replace the 'cat requirements.txt' statement with your custom command, e.g. 'egrep -v github requirement.txt'
```
cat requirements.txt | xargs -t -n1 -P9 pip install -q --download ./dist

pip install --no-index --find-links=./dist -r ./requirements.txt
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
一整个雨季

2020-12-24 03:43
Will it help if you have your build system (e.g. Jenkins) build and install everything into a build-specific virtual environment directory? When the build succeeds, you make the virtual environment relocatable, tarball it and push the resulting tablall to your "released-tarballs" storage. At deploy time, you need to grab the latest tarball and unpack it on the destination host and then it should be ready to execute. So if it takes 2 seconds to download the tarball and 0.5 seconds to unpack it on the destination host, your deployment will take 2.5 seconds.

The advantage of this approach is that all package installations happen at build time, not at deploy time.

Caveat: your build system worker that builds/compiles/installs things into a virtual env must use same architecture as the target hardware. Also your production box provisioning system will need to take care of various C library dependencies that some Python packages may have (e.g. PIL requires that libjpeg installed before it can compile JPEG-related code, also things will break if libjpeg is not installed on the target box)

It works well for us.

Making a virtual env relocatable:
```
virtualenv --relocatable /build/output/dir/build-1123423
```
In this example build-1123423 is a build-specific virtual env directory.
0 讨论(0)
发布评论:

提交评论
- 加载中...
春和景丽

2020-12-24 03:46
Parallel pip installation

This example uses xargs to parallelize the build process by approximately 4x. You can increase the parallelization factor with max-procs below (keep it approximately equal to your number of cores).

If you're trying to e.g. speed up an imaging process that you're doing over and over, it might be easier and definitely lower bandwidth consumption to just image directly on the result rather than do this each time, or build your image using pip -t or virtualenv.

Download and install packages in parallel, four at a time:
```
xargs --max-args=1 --max-procs=4 sudo pip install < requires.txt
```
Note: xargs has different parameter names on different Linux distributions. Check your distribution's man page for specifics.

Same thing inlined using a here-doc:
```
 cat << EOF | xargs --max-args=1 --max-procs=4 sudo pip install
 awscli
 bottle
 paste
 boto                                                                         
 wheel
 twine                                                                        
 markdown
 python-slugify
 python-bcrypt
 arrow
 redis
 psutil
 requests
 requests-aws
 EOF
```
Warning: there is a remote possibility that the speed of this method might confuse package manifests (depending on your distribution) if multiple pip's try to install the same dependency at exactly the same time, but it's very unlikely if you're only doing 4 at a time. It could be fixed pretty easily by pip install --uninstall depname.
0 讨论(0)
发布评论:

提交评论
- 加载中...
情歌与酒

2020-12-24 03:46
I come across with a similar issue and I ended up with the below:
```
cat requirements.txt | sed -e '/^\s*#.*$/d' -e '/^\s*$/d' | xargs -n 1 python -m pip install
```
That will read line by line the requirements.txt and execute pip. I cannot find from where I got the answer properly, so apologies for that, but I found some justification below:
1. How sed works: https://howto.lintel.in/truncate-empty-lines-using-sed/
2. Another similar answer but with git: https://stackoverflow.com/a/46494462/7127519
Hope this help with alternatives. I posted this solution here https://stackoverflow.com/a/63534476/7127519, so maybe there is some help there.
0 讨论(0)
发布评论:

提交评论
- 加载中...
臣服心动

2020-12-24 03:47
Have you analyzed the deployment process to see where the time really goes? It surprises me that running multiple parallel pip processes does not speed it up much.

If the time goes to querying PyPI and finding the packages (in particular when you also download from Github and other sources) then it may be beneficial to set up your own PyPI. You can host PyPI yourself and add the following to your requirements.txt file (docs):
```
--extra-index-url YOUR_URL_HERE
```
or the following if you wish to replace the official PyPI altogether:
```
--index-url YOUR_URL_HERE
```
This may speed up download times as all packages are now found on a nearby machine.

A lot of time also goes into compiling packages with C code, such as PIL. If this turns out to be the bottleneck then it's worth looking into compiling code in multiple processes. You may even be able to share compiled binaries between your machines (but many things would need to be similar, such as operating system, CPU word length, et cetera)
0 讨论(0)
发布评论:

提交评论
- 加载中...

孤街浪徒

2020-12-24 03:49

Inspired by Jamieson Becker's answer, I modified an install script to do parallel pip installs and it seems like and improvement. My bash script now contains a snippet like this:

requirements=''\
'numpy '\
'scipy '\
'Pillow '\
'feedgenerator '\
'jinja2 '\
'docutils '\
'argparse '\
'pygments '\
'Typogrify '\
'Markdown '\
'jsonschema '\
'pyzmq '\
'terminado '\
'pandas '\
'spyder '\
'matplotlib '\
'statlab '\
'ipython[all]>=3 '\
'ipdb '\
''tornado>=4' '\
'simplepam '\
'sqlalchemy '\
'requests '\
'Flask '\
'autopep8 '\
'python-dateutil '\
'pylibmc '\
'newrelic '\
'markdown '\
'elasticsearch '\
"'"'docker-py==1.1.0'"'"' '\
"'"'pycurl==7.19.5'"'"' '\
"'"'futures==2.2.0'"'"' '\
"'"'pytz==2014.7'"'"' '

echo requirements=${requirements}
for i in ${requirements}; do ( pip install $i > /tmp/$i.out 2>&1 & ); done

I can at least look for problems manually.

0 讨论(0)

Parallel Pip install

Parallel pip installation