urllib2 | 易学教程

python get headers only using urllib2

阅读更多关于 python get headers only using urllib2

问题 I have to implement a function to get headers only (without doing a GET or POST) using urllib2. Here is my function: def getheadersonly(url, redirections = True): if not redirections: class MyHTTPRedirectHandler(urllib2.HTTPRedirectHandler): def http_error_302(self, req, fp, code, msg, headers): return urllib2.HTTPRedirectHandler.http_error_302(self, req, fp, code, msg, headers) http_error_301 = http_error_303 = http_error_307 = http_error_302 cookieprocessor = urllib2.HTTPCookieProcessor()

Python's urllib2 doesn't work on some sites

阅读更多关于 Python's urllib2 doesn't work on some sites

I found that you can't read from some sites using Python's urllib2(or urllib). An example... urllib2.urlopen("http://www.dafont.com/").read() # Returns '' These sites work when you visit the site with a browser. I can even scrape them using PHP(didn't try other languages). I have seen other sites with the same issue - but can't remember the URL at the moment. My questions are... What is the cause of this issue? Any workarounds? I believe it gets blocked by the User-Agent. You can change User-Agent using the following sample code: USERAGENT = 'something' HEADERS = {'User-Agent': USERAGENT} req

Python's urllib2 doesn't work on some sites

阅读更多关于 Python's urllib2 doesn't work on some sites

问题 I found that you can't read from some sites using Python's urllib2(or urllib). An example... urllib2.urlopen("http://www.dafont.com/").read() # Returns '' These sites work when you visit the site with a browser. I can even scrape them using PHP(didn't try other languages). I have seen other sites with the same issue - but can't remember the URL at the moment. My questions are... What is the cause of this issue? Any workarounds? 回答1: I believe it gets blocked by the User-Agent. You can change

Python, gevent, urllib2.urlopen.read(), download accelerator

阅读更多关于 Python, gevent, urllib2.urlopen.read(), download accelerator

I am attempting to build a download accelerator for Linux. My program utilizes gevent, os, and urllib2. My program receives a URL and attempts to download the file concurrently. All of my code is valid. My only problem is that urllib2.urlopen.read() is blocking me from running the .read() function concurrently. This is the exception thats thrown at me. Traceback (most recent call last): File "/usr/lib/pymodules/python2.7/gevent/greenlet.py", line 405, in run result = self._run(*self.args, **self.kwargs) File "gevent_concurrent_downloader.py", line 94, in childTasklet _tempRead = handle.read

(转)Python3.X如何下载安装urllib2包？

阅读更多关于 (转)Python3.X如何下载安装urllib2包？

python 3.X版本是不需要安装：urllib2包的，urllib和urllib2包集合成在一个包了那现在问题是：在python3.x版本中，如何使用：urllib2.urlopen()？答： import urllib.request resp=urllib.request.urlopen(http://www.baidu.com) 来源：https://www.cnblogs.com/zdlfb/p/6130724.html 来源： https://www.cnblogs.com/yourwit/p/11673096.html

Installing python modules through proxy

阅读更多关于 Installing python modules through proxy

I want to install a couple of python packages which use easy_install. They use the urrlib2 module in their setup script. I tried using the company proxy to let easy_install download the required packages. So to test the proxy conn I tried the following code. I dont need to supply any credentials for proxy in IE. proxy = urllib2.ProxyHandler({"http":"http://mycompanyproxy-as-in-IE:8080"}) opener = urllib2.build_opener(proxy) urllib2.install_opener(opener) site = urllib2.urlopen("http://google.com") Error: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Python27

Python, gevent, urllib2.urlopen.read(), download accelerator

阅读更多关于 Python, gevent, urllib2.urlopen.read(), download accelerator

问题 I am attempting to build a download accelerator for Linux. My program utilizes gevent, os, and urllib2. My program receives a URL and attempts to download the file concurrently. All of my code is valid. My only problem is that urllib2.urlopen.read() is blocking me from running the .read() function concurrently. This is the exception thats thrown at me. Traceback (most recent call last): File "/usr/lib/pymodules/python2.7/gevent/greenlet.py", line 405, in run result = self._run(*self.args, *

urllib2/requests and HTTP relative path

阅读更多关于 urllib2/requests and HTTP relative path

问题 How can I force urllib2/requests modules to use relative paths instead of full/absolute URL?? when I send request using urllib2/requests I see in my proxy that it resolves it to: GET https://xxxx/path/to/something HTTP/1.1 Unfortunately, the server to which I'm sending it, cannot understand that request and gives me weird 302. I know it's in RFC, it just doesn't work and I'm tryign to fix it in python code. I don't have access to that server. Relative path, works well GET /path/to/something

urllib2 python (Transfer-Encoding: chunked)

阅读更多关于 urllib2 python (Transfer-Encoding: chunked)

I used the following python code to download the html page: response = urllib2.urlopen(current_URL) msg = response.read() print msg For a page such as this one , it opens the url without error but then prints only part of the html-page! In the following lines you can find the http headers of the html-page. I think the problem is due to "Transfer-Encoding: chunked". It seems urllib2 returns only the first chunk! I have difficulties reading the remaining chunks. How I can read the remaining chunks? Server: nginx/1.0.5 Date: Wed, 27 Feb 2013 14:41:28 GMT Content-Type: text/html;charset=UTF-8

Why urllib returns garbage from some wikipedia articles?

阅读更多关于 Why urllib returns garbage from some wikipedia articles?

>>> import urllib2 >>> good_article = 'http://en.wikipedia.org/wiki/Wikipedia' >>> bad_article = 'http://en.wikipedia.org/wiki/India' >>> req1 = urllib2.Request(good_article) >>> req2 = urllib2.Request(bad_article) >>> req1.add_header('User-Agent', 'Mozilla/5.0') >>> req2.add_header('User-Agent', 'Mozilla/5.0') >>> result1 = urllib2.urlopen(req1) >>> result2 = urllib2.urlopen(req2) >>> result1.readline() '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n' >>> result2.readline() '\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03