Difference between Python urllib.urlretrieve() and wget

白昼怎懂夜的黑 提交于 2019-11-30 03:51:15
Peter Lyons

The answer is quite simple. Python's urllib and urllib2 are nowhere near as mature and robust as they could be. Even better than wget in my experience is cURL. I've written code that downloads gigabytes of files over HTTP with file sizes ranging from 50 KB to over 2 GB. To my knowledge, cURL is the most reliable piece of software on the planet right now for this task. I don't think python, wget, or even most web browsers can match it in terms of correctness and robustness of implementation. On a modern enough python using urllib2 in the exact right way, it can be made pretty reliable, but I still run a curl subprocess and that is absolutely rock solid.

Another way to state this is that cURL does one thing only and it does it better than any other software because it has had much more development and refinement. Python's urllib2 is serviceable and convenient and works well enough for small to average workloads, but cURL is way ahead in terms of reliability.

Also, cURL has numerous options to tune the reliability behavior including retry counts, timeout values, etc.

If you are using:

page = urllib.retrieve('http://example.com/really_big_file.html')

you are creating a 500mb string which may well tax your machine, make it slow, and cause the connection to timeout. If so, you should be using:

(filename, headers) = urllib.retrieve('http://...', 'local_outputfile.html')

which won't tax the interpreter.

It is worth noting urllib.retrieve() uses urllib.urlopen() which is now deprecated.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!