Web Crawler - Ignore Robots.txt file?

后端 未结 2 722
借酒劲吻你
借酒劲吻你 2020-12-31 07:34

Some servers have a robots.txt file in order to stop web crawlers from crawling through their websites. Is there a way to make a web crawler ignore the robots.txt file? I am

相关标签:
2条回答
  • 2020-12-31 07:55

    The documentation for mechanize has this sample code:

    br = mechanize.Browser()
    ....
    # Ignore robots.txt.  Do not do this without thought and consideration.
    br.set_handle_robots(False)
    

    That does exactly what you want.

    0 讨论(0)
  • 2020-12-31 08:05

    This looks like what you need:

    from mechanize import Browser
    br = Browser()
    
    # Ignore robots.txt
    br.set_handle_robots( False )
    

    but you know what you're doing…

    0 讨论(0)
提交回复
热议问题