Website is up and running but parsing it results in HTTP Error 503

问题

I want to crawl a webpage using urllib2 library and extract some information according to my need. I am able to freely navigate the site(going from one link to another and so on), but when I try to parse-it I am getting an error

HTTP Error 503 : Service Temporarily Unavailable

I searched about it on net and found out that this error occurs when "web site's server is not available at that time"

I am confused after reading this, if website server is down then how come its up and running(since I am able to navigate the webpage), and if the server is not down then why I am getting this 503 Error.

Is their a possibility that the server has done something to prevent the parsing of web-page

Thanks in advance.

回答1:

Most probably your user-agent is banned from the server, so as to avoid, well, web crawlers. Therefore some websites, including Wikipedia, show up a 50x error when using an unwanted user-agent (such as wget, curl, urllib, …)

However, changing the user-agent might be enough. At least, it's the case for Wikipedia, which works just fine when using Firefox user agent. (The "bann" most probably only relies on the user-agent).

Finally, there must be a reason for those websites to ban web crawlers. Depending on what you're working on, you might want to use another solution. For example, wikipedia provides database dumps, which can be convenient if you intend to make an intensive use of it.

PS. Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.11) Gecko/20101012 Firefox/3.6.11 is the user-agent I use for wikipedia on a project of mine.

来源：https://stackoverflow.com/questions/17386969/website-is-up-and-running-but-parsing-it-results-in-http-error-503

标签

python-2.7

webserver

urllib2

lxml