HTTPError when using urllib2 read()

问题

I'm trying to scrape a web page using urllib2 and BeautifulSoup. It was working fine and then when I put in an input() in a different part of my code to try and debug something, I got an HTTPError. When I tried running my program again, I got an HTTPError when trying calling read(). The error stack is below:

[2013-07-17 16:47:07,415: ERROR/MainProcess] Task program.tasks.testTask[460db7cf-ff58-4a51-9c0f-749affc66abb] raised exception: IOError()
16:47:07 celeryd.1 | Traceback (most recent call last):
16:47:07 celeryd.1 |   File "/Users/username/folder/server2/venv/lib/python2.7/site-packages/celery/execute/trace.py", line 181, in trace_task
16:47:07 celeryd.1 |     R = retval = fun(*args, **kwargs)
16:47:07 celeryd.1 |   File "/Users/username/folder/server2/program/tasks.py", line 193, in run
16:47:07 celeryd.1 |     self.get_top_itunes_game_by_genre(genre)
16:47:07 celeryd.1 |   File "/Users/username/folder/server2/program/tasks.py", line 244, in get_top_itunes_game_by_genre
16:47:07 celeryd.1 |     game_page = BeautifulSoup(urllib2.urlopen(game_url).read())
16:47:07 celeryd.1 |   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 126, in urlopen
16:47:07 celeryd.1 |     return _opener.open(url, data, timeout)
16:47:07 celeryd.1 |   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open
16:47:07 celeryd.1 |     response = meth(req, response)
16:47:07 celeryd.1 |   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response
16:47:07 celeryd.1 |     'http', request, response, code, msg, hdrs)
16:47:07 celeryd.1 |   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 438, in error
16:47:07 celeryd.1 |     return self._call_chain(*args)
16:47:07 celeryd.1 |   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain
16:47:07 celeryd.1 |     result = func(*args)
16:47:07 celeryd.1 |   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 521, in http_error_default
16:47:07 celeryd.1 |     raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
16:47:07 celeryd.1 | HTTPError

Here's the code:

for game_url in urls:    
    game_page = BeautifulSoup(urllib2.urlopen(game_url).read())
    # code to process page

Does anyone know why I started getting this error? Thanks!

回答1:

Changing my comment into an answer:

The page that you're scraping responded with (most likely) a 4xx response, and urllib2 raises an HTTPError, as it says it does in the docs. It is your job to catch that exception and (hopefully) do something with it, log it or what have you. Your traceback doesn't display the code/reason for the HTTPError for whatever reason, but it is there. Look at the 'code' and 'reason' attributes of the error.

editorial: It is possible that the website that you were scraping figured out that you're a robot. You might want to take a moment to rewrite your scraper to use a more server-friendly (and vastly better API) library. urllib2 is fine for one-off tasks but it has numerous shortcomings that I won't get into here. Possible superior libraries to look at are requests, mechanize, maybe httplib2. All have up/downsides so I can't tell you the one that's right for your needs.

You also may want to look at what user-agent header you're sending with your requests, since if you self-identify as a robot, well. Yeah.

来源：https://stackoverflow.com/questions/17712282/httperror-when-using-urllib2-read

标签

python

urllib2

http-error