Downloading a LOT of files using python

我的未来我决定 提交于 2019-11-28 08:41:00

The usual pattern with multiprocessing is to create a job() function that takes arguments and performs some potentially CPU bound work.

Example: (based on your code)

from multiprocessing import Pool

def job(url):
    file_name = str(url.split('/')[-1])
    u = urllib2.urlopen(url)
    f = open(file_name, 'wb')
    f.write(u.read())
    f.close()

pool = Pool()
urls = ["ftp://ftp.sec.gov/{0:s}".format(f) for f in flist]
pool.map(job, urls)

This will do a number of things:

  • Create a multiprocessing pool and set of workers as you have CPU(s) or CPU Core(s)
  • Create a list of inputs to the job() function.
  • Map the list of inputs urls to job() and wait for all jobs to complete.

Python's multiprocessing.Pool.map will take care of splitting up your input across the no. of workers in the pool.

Another useful neat little thing I've done for this kind of work is to use progress like this:

from multiprocessing import Pool


from progress.bar import Bar


def job(input):
    # do some work


pool = Pool()
inputs = range(100)
bar = Bar('Processing', max=len(inputs))
for i in pool.imap(job, inputs):
    bar.next()
bar.finish()

This gives you a nice progress bar on your console as your jobs are progressing so you have some idea of progress and eta, etc.

I also find the requests library very useful here and a much nicer set of API(s) for dealing with web resources and downloading of content.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!