urllib2

python get headers only using urllib2

ⅰ亾dé卋堺 提交于 2019-12-01 11:53:54
问题 I have to implement a function to get headers only (without doing a GET or POST) using urllib2. Here is my function: def getheadersonly(url, redirections = True): if not redirections: class MyHTTPRedirectHandler(urllib2.HTTPRedirectHandler): def http_error_302(self, req, fp, code, msg, headers): return urllib2.HTTPRedirectHandler.http_error_302(self, req, fp, code, msg, headers) http_error_301 = http_error_303 = http_error_307 = http_error_302 cookieprocessor = urllib2.HTTPCookieProcessor()

Python's urllib2 doesn't work on some sites

本小妞迷上赌 提交于 2019-12-01 11:29:25
I found that you can't read from some sites using Python's urllib2(or urllib). An example... urllib2.urlopen("http://www.dafont.com/").read() # Returns '' These sites work when you visit the site with a browser. I can even scrape them using PHP(didn't try other languages). I have seen other sites with the same issue - but can't remember the URL at the moment. My questions are... What is the cause of this issue? Any workarounds? I believe it gets blocked by the User-Agent. You can change User-Agent using the following sample code: USERAGENT = 'something' HEADERS = {'User-Agent': USERAGENT} req

Python's urllib2 doesn't work on some sites

☆樱花仙子☆ 提交于 2019-12-01 09:45:35
问题 I found that you can't read from some sites using Python's urllib2(or urllib). An example... urllib2.urlopen("http://www.dafont.com/").read() # Returns '' These sites work when you visit the site with a browser. I can even scrape them using PHP(didn't try other languages). I have seen other sites with the same issue - but can't remember the URL at the moment. My questions are... What is the cause of this issue? Any workarounds? 回答1: I believe it gets blocked by the User-Agent. You can change

Python, gevent, urllib2.urlopen.read(), download accelerator

ぃ、小莉子 提交于 2019-12-01 08:41:16
I am attempting to build a download accelerator for Linux. My program utilizes gevent, os, and urllib2. My program receives a URL and attempts to download the file concurrently. All of my code is valid. My only problem is that urllib2.urlopen.read() is blocking me from running the .read() function concurrently. This is the exception thats thrown at me. Traceback (most recent call last): File "/usr/lib/pymodules/python2.7/gevent/greenlet.py", line 405, in run result = self._run(*self.args, **self.kwargs) File "gevent_concurrent_downloader.py", line 94, in childTasklet _tempRead = handle.read

(转)Python3.X如何下载安装urllib2包 ?

柔情痞子 提交于 2019-12-01 08:28:36
python 3.X版本是不需要安装:urllib2包的,urllib和urllib2包集合成在一个包了 那现在问题是: 在python3.x版本中,如何使用:urllib2.urlopen()? 答: import urllib.request resp=urllib.request.urlopen(http://www.baidu.com) 来源:https://www.cnblogs.com/zdlfb/p/6130724.html 来源: https://www.cnblogs.com/yourwit/p/11673096.html

Installing python modules through proxy

那年仲夏 提交于 2019-12-01 08:09:54
I want to install a couple of python packages which use easy_install. They use the urrlib2 module in their setup script. I tried using the company proxy to let easy_install download the required packages. So to test the proxy conn I tried the following code. I dont need to supply any credentials for proxy in IE. proxy = urllib2.ProxyHandler({"http":"http://mycompanyproxy-as-in-IE:8080"}) opener = urllib2.build_opener(proxy) urllib2.install_opener(opener) site = urllib2.urlopen("http://google.com") Error: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Python27

Python, gevent, urllib2.urlopen.read(), download accelerator

巧了我就是萌 提交于 2019-12-01 07:21:19
问题 I am attempting to build a download accelerator for Linux. My program utilizes gevent, os, and urllib2. My program receives a URL and attempts to download the file concurrently. All of my code is valid. My only problem is that urllib2.urlopen.read() is blocking me from running the .read() function concurrently. This is the exception thats thrown at me. Traceback (most recent call last): File "/usr/lib/pymodules/python2.7/gevent/greenlet.py", line 405, in run result = self._run(*self.args, *

urllib2/requests and HTTP relative path

断了今生、忘了曾经 提交于 2019-12-01 06:37:33
问题 How can I force urllib2/requests modules to use relative paths instead of full/absolute URL?? when I send request using urllib2/requests I see in my proxy that it resolves it to: GET https://xxxx/path/to/something HTTP/1.1 Unfortunately, the server to which I'm sending it, cannot understand that request and gives me weird 302. I know it's in RFC, it just doesn't work and I'm tryign to fix it in python code. I don't have access to that server. Relative path, works well GET /path/to/something

urllib2 python (Transfer-Encoding: chunked)

点点圈 提交于 2019-12-01 06:34:00
I used the following python code to download the html page: response = urllib2.urlopen(current_URL) msg = response.read() print msg For a page such as this one , it opens the url without error but then prints only part of the html-page! In the following lines you can find the http headers of the html-page. I think the problem is due to "Transfer-Encoding: chunked". It seems urllib2 returns only the first chunk! I have difficulties reading the remaining chunks. How I can read the remaining chunks? Server: nginx/1.0.5 Date: Wed, 27 Feb 2013 14:41:28 GMT Content-Type: text/html;charset=UTF-8

Why urllib returns garbage from some wikipedia articles?

徘徊边缘 提交于 2019-12-01 06:20:41
>>> import urllib2 >>> good_article = 'http://en.wikipedia.org/wiki/Wikipedia' >>> bad_article = 'http://en.wikipedia.org/wiki/India' >>> req1 = urllib2.Request(good_article) >>> req2 = urllib2.Request(bad_article) >>> req1.add_header('User-Agent', 'Mozilla/5.0') >>> req2.add_header('User-Agent', 'Mozilla/5.0') >>> result1 = urllib2.urlopen(req1) >>> result2 = urllib2.urlopen(req2) >>> result1.readline() '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n' >>> result2.readline() '\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03