Website is up and running but parsing it results in HTTP Error 503

狂风中的少年 提交于 2019-12-11 05:38:06

问题


I want to crawl a webpage using urllib2 library and extract some information according to my need. I am able to freely navigate the site(going from one link to another and so on), but when I try to parse-it I am getting an error

HTTP Error 503 : Service Temporarily Unavailable

I searched about it on net and found out that this error occurs when "web site's server is not available at that time"

I am confused after reading this, if website server is down then how come its up and running(since I am able to navigate the webpage), and if the server is not down then why I am getting this 503 Error.

Is their a possibility that the server has done something to prevent the parsing of web-page

Thanks in advance.


回答1:


Most probably your user-agent is banned from the server, so as to avoid, well, web crawlers. Therefore some websites, including Wikipedia, show up a 50x error when using an unwanted user-agent (such as wget, curl, urllib, …)

However, changing the user-agent might be enough. At least, it's the case for Wikipedia, which works just fine when using Firefox user agent. (The "bann" most probably only relies on the user-agent).

Finally, there must be a reason for those websites to ban web crawlers. Depending on what you're working on, you might want to use another solution. For example, wikipedia provides database dumps, which can be convenient if you intend to make an intensive use of it.

PS. Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.11) Gecko/20101012 Firefox/3.6.11 is the user-agent I use for wikipedia on a project of mine.



来源:https://stackoverflow.com/questions/17386969/website-is-up-and-running-but-parsing-it-results-in-http-error-503

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!