Screen scraping: getting around “HTTP Error 403: request disallowed by robots.txt”

萝らか妹 提交于 2019-12-17 21:44:11

问题


Is there a way to get around the following?

httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt

Is the only way around this to contact the site-owner (barnesandnoble.com).. i'm building a site that would bring them more sales, not sure why they would deny access at a certain depth.

I'm using mechanize and BeautifulSoup on Python2.6.

hoping for a work-around


回答1:


You can try lying about your user agent (e.g., by trying to make believe you're a human being and not a robot) if you want to get in possible legal trouble with Barnes & Noble. Why not instead get in touch with their business development department and convince them to authorize you specifically? They're no doubt just trying to avoid getting their site scraped by some classes of robots such as price comparison engines, and if you can convince them that you're not one, sign a contract, etc, they may well be willing to make an exception for you.

A "technical" workaround that just breaks their policies as encoded in robots.txt is a high-legal-risk approach that I would never recommend. BTW, how does their robots.txt read?




回答2:


oh you need to ignore the robots.txt

br = mechanize.Browser()
br.set_handle_robots(False)



回答3:


Mechanize automatically follows robots.txt, but it can be disabled assuming you have permission, or you have thought the ethics through ..

Set a flag in your browser:

browser.set_handle_equiv(False) 

This ignores robots.txt.

Also, make sure you throttle your requests, so you don't put too much load on their site. (Note, this also makes it less likely that they will detect and ban you).




回答4:


The code to make a correct request:

br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
resp = br.open(url)
print resp.info()  # headers
print resp.read()  # content



回答5:


The error you're receiving is not related to the user agent. mechanize by default checks robots.txt directives automatically when you use it to navigate to a site. Use the .set_handle_robots(false) method of mechanize.browser to disable this behavior.




回答6:


Set your User-Agent header to match some real IE/FF User-Agent.

Here's my IE8 useragent string:

Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; AskTB5.6)



回答7:


Without debating the ethics of this you could modify the headers to look like the googlebot for example, or is the googlebot blocked as well?




回答8:


As it seems, you have to do less work to bypass robots.txt, at least says this article. So you might have to remove some code to ignore the filter.



来源:https://stackoverflow.com/questions/2846105/screen-scraping-getting-around-http-error-403-request-disallowed-by-robots-tx

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!