Protection from Web Scraping

元气小坏坏 提交于 2019-12-03 03:36:58

The main strategies for preventing this are:

  • require registration, so you can limit the requests per user
  • captchas for registration and non-registered users
  • rate limiting for IPs
  • require JavaScript - writing a scraper that can read JS is harder
  • robots blocking, and bot detection (e.g. request rates, hidden link traps)
  • data poisoning. Put in books and links that nobody will want to have, that stall the download for bots that blindly collect everything.
  • mutation. Frequently change your templates, so that the scrapers may fail to find the desired contents.

Note that you can use Captchas very flexible.

For example: first book for each IP every day is non-captcha protected. But in order to access a second book, a captcha needs to be solved.

Since you found that many of the items listed by Anony-Mousse dont solve your problem, I wanted to come in and suggest an alternative. Have you explored third party platforms that offer web scraping protection as a service? I'm going to list some of the solutions available on the market and try to lump them together. For full disclosure, I am one of the co-founders of Distil Networks, one of the companies that I am listing.

Web Scraping protection as a core competency:

  • Distil Networks
  • Sentor Assassin

Web Scraping protection as a feature in a larger product suite:

My opinion is that companies that try to solve the bot problem as a feature dont effectively do it well. Its just not their core competency and many loopholes exist

  • Akamai Kona
  • F5 ASM module to the BigIP loadbalancer
  • Imperva Web Application Firewall appliance
  • Incapsula, Imperva's cloud Web Application Firewall

It might also be helpful to talk about some of the pitfalls of the points mentioned:

  • captchas for registration and non-registered users captchas have been proven to be ineefective thanks to OCR software and captcha farms
  • rate limiting for IPs This could have a really high false positive rate as it lumps together users behind a shared IP. Also could miss a lot of bots if they simply rotate or annonomize the IP they use
  • require JavaScript Selenium, Phantom, and dozens of other scraping tools render javascript
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!