Parallel download using Curl command line utility

后端未结

关注

 8  670

I want to download some pages from a website and I did it successfully using curl but I was wondering if somehow curl downloads multiple pages at a

相关标签:

8条回答

伪装坚强ぢ

2020-12-13 19:25
My answer is a bit late, but I believe all of the existing answers fall just a little short. The way I do things like this is with xargs, which is capable of running a specified number of commands in subprocesses.

The one-liner I would use is, simply:
```
$ seq 1 10 | xargs -n1 -P2 bash -c 'i=$0; url="http://example.com/?page${i}.html"; curl -O -s $url'
```
This warrants some explanation. The use of -n 1 instructs xargs to process a single input argument at a time. In this example, the numbers 1 ... 10 are each processed separately. And -P 2 tells xargs to keep 2 subprocesses running all the time, each one handling a single argument, until all of the input arguments have been processed.

You can think of this as MapReduce in the shell. Or perhaps just the Map phase. Regardless, it's an effective way to get a lot of work done while ensuring that you don't fork bomb your machine. It's possible to do something similar in a for loop in a shell, but end up doing process management, which starts to seem pretty pointless once you realize how insanely great this use of xargs is.

Update: I suspect that my example with xargs could be improved (at least on Mac OS X and BSD with the -J flag). With GNU Parallel, the command is a bit less unwieldy as well:
```
parallel --jobs 2 curl -O -s http://example.com/?page{}.html ::: {1..10}
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
被撕碎了的回忆

2020-12-13 19:25
Run a limited number of process is easy if your system have commands like pidof or pgrep which, given a process name, return the pids (the count of the pids tell how many are running).

Something like this:
```
#!/bin/sh
max=4
running_curl() {
    set -- $(pidof curl)
    echo $#
}
while [ $# -gt 0 ]; do
    while [ $(running_curl) -ge $max ] ; do
        sleep 1
    done
    curl "$1" --create-dirs -o "${1##*://}" &
    shift
done
```
to call like this:
```
script.sh $(for i in `seq 1 10`; do printf "http://example/%s.html " "$i"; done)
```
The curl line of the script is untested.
0 讨论(0)
发布评论:

提交评论
- 加载中...
一向

2020-12-13 19:30
Curl can also accelerate a download of a file by splitting it into parts:
```
$ man curl |grep -A2 '\--range'
       -r/--range <range>
              (HTTP/FTP/SFTP/FILE)  Retrieve a byte range (i.e a partial docu-
              ment) from a HTTP/1.1, FTP or  SFTP  server  or  a  local  FILE.
```
Here is a script that will automatically launch curl with the desired number of concurrent processes: https://github.com/axelabs/splitcurl
0 讨论(0)
发布评论:

提交评论
- 加载中...
暗喜

2020-12-13 19:33
Well, curl is just a simple UNIX process. You can have as many of these curl processes running in parallel and sending their outputs to different files.

curl can use the filename part of the URL to generate the local file. Just use the -O option (man curl for details).

You could use something like the following
```
urls="http://example.com/?page1.html http://example.com?page2.html" # add more URLs here

for url in $urls; do
   # run the curl job in the background so we can start another job
   # and disable the progress bar (-s)
   echo "fetching $url"
   curl $url -O -s &
done
wait #wait for all background jobs to terminate
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
爱一瞬间的悲伤

2020-12-13 19:33
As of 7.66.0, the curl utility finally has built-in support for parallel downloads of multiple URLs within a single non-blocking process, which should be much faster and more resource-efficient compared to xargs and background spawning, in most cases:
```
curl -Z 'http://httpbin.org/anything/[1-9].{txt,html}' -o '#1.#2'
```
This will download 18 links in parallel and write them out to 18 different files, also in parallel. The official announcement of this feature from Daniel Stenberg is here: https://daniel.haxx.se/blog/2019/07/22/curl-goez-parallel/
0 讨论(0)
发布评论:

提交评论
- 加载中...

梦如初夏

2020-12-13 19:46

I came up with a solution based on fmt and xargs. The idea is to specify multiple URLs inside braces http://example.com/page{1,2,3}.html and run them in parallel with xargs. Following would start downloading in 3 process:

seq 1 50 | fmt -w40 | tr ' ' ',' \
| awk -v url="http://example.com/" '{print url "page{" $1 "}.html"}' \
| xargs -P3 -n1 curl -o

so 4 downloadable lines of URLs are generated and sent to xargs

curl -o http://example.com/page{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}.html
curl -o http://example.com/page{17,18,19,20,21,22,23,24,25,26,27,28,29}.html
curl -o http://example.com/page{30,31,32,33,34,35,36,37,38,39,40,41,42}.html
curl -o http://example.com/page{43,44,45,46,47,48,49,50}.html

0 讨论(0)

1 2 下一页