Screen scraping: getting around “HTTP Error 403: request disallowed by robots.txt”

前端未结

关注

 8  1124

Is there a way to get around the following?

httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt

Is the only way around

相关标签:

8条回答

梦谈多话

2020-12-12 17:51

Without debating the ethics of this you could modify the headers to look like the googlebot for example, or is the googlebot blocked as well?

0 讨论(0)
发布评论:

提交评论
- 加载中...

不知归路

2020-12-12 17:53

The code to make a correct request:

br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
resp = br.open(url)
print resp.info()  # headers
print resp.read()  # content

0 讨论(0)

上一页 1 2