Screen scraping: getting around “HTTP Error 403: request disallowed by robots.txt”

前端 未结 8 1138
借酒劲吻你
借酒劲吻你 2020-12-12 17:15

Is there a way to get around the following?

httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt

Is the only way around

8条回答
  •  不知归路
    2020-12-12 17:53

    The code to make a correct request:

    br = mechanize.Browser()
    br.set_handle_robots(False)
    br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
    resp = br.open(url)
    print resp.info()  # headers
    print resp.read()  # content
    

提交回复
热议问题