How to prevent unauthorized spidering

后端 未结 6 1211
刺人心
刺人心 2021-02-06 04:59

I want to prevent automated html scraping from one of our sites while not affecting legitimate spidering (googlebot, etc.). Is there something that already exists to accomplish

6条回答
  •  悲哀的现实
    2021-02-06 05:38

    One approach is to set up an HTTP tar pit; embed a link that will only be visible to automated crawlers. The link should go to a page stuffed with random text and links to itself (but with additional page info: /tarpit/foo.html , /tarpit/bar.html , /tarpit/baz.html - but have the script at /tarpit/ handle all requests with the 200 result).

    To keep the good guys out of the pit, generate a 302 redirect to your home page if the user agent is google or yahoo.

    It isn't perfect, but it will at least slow down the naive ones.

    EDIT: As suggested by Constantin, you could mark the tar pit as offlimits in robots.txt. The good guys use web spiders that honor this protocol will stay out of the tar pit. This would probably get rid of the requirement to generate redirects for known good people.

提交回复
热议问题