Trying to access the Internet using urllib2 in Python

问题

I'm trying to write a program that will (among other things) get text or source code from a predetermined website. I'm learning Python to do this, and most sources have told me to use urllib2. Just as a test, I tried this code:

import urllib2
response = urllib2.urlopen('http://www.python.org')
html = response.read()

Instead of acting in any expected way, the shell just sits there, like it's waiting for some input. There aren't even an ">>>" or "...". The only way to exit this state is with [ctrl]+c. When I do this, I get a whole bunch of error messages, like

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/m/mls/pkg/ix86-Linux-RHEL5/lib/python2.5/urllib2.py", line 124, in urlopen
    return _opener.open(url, data)
  File "/m/mls/pkg/ix86-Linux-RHEL5/lib/python2.5/urllib2.py", line 381, in open
    response = self._open(req, data)

I'd appreciate any feedback. Is there a different tool than urllib2 to use, or can you give advice on how to fix this. I'm using a network computer at my work, and I'm not entirely sure how the shell is configured or how that might affect anything.

回答1:

With 99.999% probability, it's a proxy issue. Python is incredibly bad at detecting the right http proxy to use, and when it cannot find the right one, it just hangs and eventually times out.

So first you have to find out which proxy should be used, check the options of your browser (Tools -> Internet Options -> Connections -> LAN Setup... in IE, etc). If it's using a script to autoconfigure, you'll have to fetch the script (which should be some sort of javascript) and find out where your request is supposed to go. If there is no script specified and the "automatically determine" option is ticked, you might as well just ask some IT guy at your company.

I assume you're using Python 2.x. From the Python docs on urllib :

# Use http://www.someproxy.com:3128 for http proxying
proxies = {'http': 'http://www.someproxy.com:3128'}
filehandle = urllib.urlopen(some_url, proxies=proxies)

Note that the point on ProxyHandler figuring out default values is what happens already when you use urlopen, so it's probably not going to work.

If you really want urllib2, you'll have to specify a ProxyHandler, like the example in this page. Authentication might or might not be required (usually it's not).

回答2:

This isn't a good answer to "How to do this with urllib2", but let me suggest python-requests. The whole reason it exists is because the author found urllib2 to be an unwieldy mess. And he's probably right.

回答3:

That is very weird, have you tried a different URL?
Otherwise there is HTTPLib, however it is more complicated. Here's your example using HTTPLib

import httplib as h
domain = h.HTTPConnection('www.python.org')
domain.connect()
domain.request('GET', '/fish.html')
response = domain.getresponse()
if response.status == h.OK:
    html = response.read()

回答4:

I get a 404 error almost immediately (no hanging):

>>> import urllib2
>>> response = urllib2.urlopen('http://www.python.org/fish.html')
Traceback (most recent call last):
  ...
urllib2.HTTPError: HTTP Error 404: Not Found

If I try and contact an address that doesn't have an HTTP server running, it hangs for quite a while until the timeout happens. You can shorten it by passing the timeout parameter to urlopen:

>>> response = urllib2.urlopen('http://cs.princeton.edu/fish.html', timeout=5)
Traceback (most recent call last):
  ...
urllib2.URLError: <urlopen error timed out>

来源：https://stackoverflow.com/questions/8761583/trying-to-access-the-internet-using-urllib2-in-python

标签

python

urllib2