urllib

Parallel fetching of files

你离开我真会死。 提交于 2019-11-27 18:03:26
In order to download files, I'm creating a urlopen object (urllib2 class) and reading it in chunks. I would like to connect to the server several times and download the file in six different sessions. Doing that, the download speed should get faster. Many download managers have this feature. I thought about specifying the part of file i would like to download in each session, and somehow process all the sessions in the same time. I'm not sure how I can achieve this. Sounds like you want to use one of the flavors of HTTP Range that are available. edit Updated link to point to the w3.org stored

How to route urllib requests through the TOR network? [duplicate]

心已入冬 提交于 2019-11-27 17:45:27
This question already has an answer here: How to make urllib2 requests through Tor in Python? 12 answers How to route urllib requests through the TOR network? This works for me (using urllib2, haven't tried urllib): def req(url): proxy_support = urllib2.ProxyHandler({"http" : "127.0.0.1:8118"}) opener = urllib2.build_opener(proxy_support) opener.addheaders = [('User-agent', 'Mozilla/5.0')] return opener.open(url).read() print req('http://google.com') Tor works as a proxy, right? So ask yourself "How do I use proxies in urllib?" Now, when I look at the docs, first thing I see is urllib.urlopen

catch specific HTTP error in python

不打扰是莪最后的温柔 提交于 2019-11-27 17:40:57
I want to catch a specific http error and not any one of the entire family.. what I was trying to do is -- import urllib2 try: urllib2.urlopen("some url") except urllib2.HTTPError: <whatever> but what I end up is catching any kind of http error, but I want to catch only if the specified webpage doesn't exist!! probably that's HTTP error 404..but I don't know how to specify that catch only error 404 and let the system run the default handler for other events..ny suggestions?? Tim Pietzcker Just catch urllib2.HTTPError , handle it, and if it's not Error 404, simply use raise to re-raise the

Python, opposite function urllib.urlencode

非 Y 不嫁゛ 提交于 2019-11-27 17:22:10
How can I convert data after processing urllib.urlencode to dict? urllib.urldecode does not exist. As the docs for urlencode say, The urlparse module provides the functions parse_qs() and parse_qsl() which are used to parse query strings into Python data structures. (In older Python releases, they were in the cgi module). So, for example: >>> import urllib >>> import urlparse >>> d = {'a':'b', 'c':'d'} >>> s = urllib.urlencode(d) >>> s 'a=b&c=d' >>> d1 = urlparse.parse_qs(s) >>> d1 {'a': ['b'], 'c': ['d']} The obvious difference between the original dictionary d and the "round-tripped" one d1

In Python, how do I use urllib to see if a website is 404 or 200?

孤街醉人 提交于 2019-11-27 17:21:54
How to get the code of the headers through urllib? The getcode() method (Added in python2.6) returns the HTTP status code that was sent with the response, or None if the URL is no HTTP URL. >>> a=urllib.urlopen('http://www.google.com/asdfsf') >>> a.getcode() 404 >>> a=urllib.urlopen('http://www.google.com/') >>> a.getcode() 200 Joe Holloway You can use urllib2 as well: import urllib2 req = urllib2.Request('http://www.python.org/fish.html') try: resp = urllib2.urlopen(req) except urllib2.HTTPError as e: if e.code == 404: # do something... else: # ... except urllib2.URLError as e: # Not an HTTP

Get size of a file before downloading in Python

不想你离开。 提交于 2019-11-27 17:10:07
I'm downloading an entire directory from a web server. It works OK, but I can't figure how to get the file size before download to compare if it was updated on the server or not. Can this be done as if I was downloading the file from a FTP server? import urllib import re url = "http://www.someurl.com" # Download the page locally f = urllib.urlopen(url) html = f.read() f.close() f = open ("temp.htm", "w") f.write (html) f.close() # List only the .TXT / .ZIP files fnames = re.findall('^.*<a href="(\w+(?:\.txt|.zip)?)".*$', html, re.MULTILINE) for fname in fnames: print fname, "..." f = urllib

Why can't I get Python's urlopen() method to work on Windows?

家住魔仙堡 提交于 2019-11-27 16:08:38
Why isn't this simple Python code working? import urllib file = urllib.urlopen('http://www.google.com') print file.read() This is the error that I get: Traceback (most recent call last): File "C:\workspace\GarchUpdate\src\Practice.py", line 26, in <module> file = urllib.urlopen('http://www.google.com') File "C:\Python26\lib\urllib.py", line 87, in urlopen return opener.open(url) File "C:\Python26\lib\urllib.py", line 206, in open return getattr(self, name)(url) File "C:\Python26\lib\urllib.py", line 345, in open_http h.endheaders() File "C:\Python26\lib\httplib.py", line 892, in endheaders

python 爬虫 urllib模块 反爬虫机制UA

℡╲_俬逩灬. 提交于 2019-11-27 15:47:41
方法: 使用urlencode函数 urllib.request.urlopen() import urllib.request import urllib.parse url = 'https://www.sogou.com/web?' #将get请求中url携带的参数封装至字典中 param = { 'query':'周杰伦' } #对url中的非ascii进行编码 param = urllib.parse.urlencode(param) #将编码后的数据值拼接回url中 url += param response = urllib.request.urlopen(url=url) data = response.read() with open('./周杰伦1.html','wb') as fp: fp.write(data) print('写入文件完毕') 开发者工具浏览器按F12或者右键按检查 ,有个抓包工具network,刷新页面,可以看到网页资源,可以看到请求头信息,UA 在抓包工具点击任意请求,可以看到所有请求信息,向应信息, 主要用到headers,response,response headers存放响应头信息,request headers 存放请求信息 反爬出机制:网站会检查请求的UA,如果发现UA是爬虫程序,会拒绝提供网站页面数据。

Parse the html code for a whole webpage scrolled down

可紊 提交于 2019-11-27 14:33:39
from bs4 import BeautifulSoup import urllib,sys reload(sys) sys.setdefaultencoding("utf-8") r = urllib.urlopen('https://twitter.com/ndtv').read() soup = BeautifulSoup(r) This would give me not the whole web page scrolled down the end which I want but only some of it. EDIT: from selenium import webdriver from selenium.common.exceptions import StaleElementReferenceException, TimeoutException from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from bs4 import BeautifulSoup import

How to use urllib to download image from web

淺唱寂寞╮ 提交于 2019-11-27 14:33:04
I'm trying to download an image using this code: from urllib import urlretrieve urlretrieve('http://gdimitriou.eu/wp-content/uploads/2008/04/google-image-search.jpg', 'google-image-search.jpg') It worked. The image was downloaded and can be open by any image viewer software. However, the code below is not working. Downloaded image is only 2KB and can't be opened by any image viewer. from urllib import urlretrieve urlretrieve('http://upload.wikimedia.org/wikipedia/en/4/44/Zindagi1976.jpg', 'Zindagi1976.jpg') Here is the result in HTML format. ERROR The requested URL could not be retrieved While