Detecting 'stealth' web-crawlers

后端 未结 11 1620
小鲜肉
小鲜肉 2020-11-28 00:15

What options are there to detect web-crawlers that do not want to be detected?

(I know that listing detection techniques will allow the smart stealth-crawle

11条回答
  •  再見小時候
    2020-11-28 01:11

    People keep addressing broad crawlers but not crawlers that are specialized for your website.

    I write stealth crawlers and if they are individually built no amount of honey pots or hidden links will have any effect whatsoever - the only real way to detect specialised crawlers is by inspecting connection patterns.

    The best systems use AI (e.g. Linkedin) use AI to address this.
    The easiest solution is write log parsers that analyze IP connections and simply blacklist those IPs or serve captcha, at least temporary.

    e.g.
    if IP X is seen every 2 seconds connecting to foo.com/cars/*.html but not any other pages - it's most likely a bot or a hungry power user.

    Alternatively there are various javascript challenges that act as protection (e.g. Cloudflare's anti-bot system), but those are easily solvable, you can write something custom and that might be enough deterrent to make it not worth the effort for the crawler.

    However you must ask a question are you willing to false-positive legit users and introduce inconvenience for them to prevent bot traffic. Protecting public data is an impossible paradox.

提交回复
热议问题