Prevent automated tools from accessing the website

孤街浪徒 提交于 2019-12-24 03:36:20

问题


The data on our website can easily be scraped. How can we detect whether a human is viewing the site or a tool?

One way is by calculating time which a user stays on a page. I do not know how to implement that. Can anyone help to detect and prevent automated tools from scraping data from my website?

I used a security image in login section, but even then a human may log in and then use an automated tool. When the recaptcha image appears after a period of time the user may type the security image and again, use an automated tool to continue scraping data.

I developed a tool to scrape another site. So I only want to prevent this from happening to my site!


回答1:


DON'T do it.

It's the web, you will not be able to stop someone from scraping data if they really want it. I've done it many, many times before and got around every restriction they put in place. In fact having a restriction in place motivates me further to try and get the data.

The more you restrict your system, the worse you'll make user experience for legitimate users. Just a bad idea.




回答2:


It's the web. You need to assume that anything you put out there can be read by human or machine. Even if you can prevent it today, someone will figure out how to bypass it tomorrow. Captchas have been broken for some time now, and sooner or later, so will the alternatives.

However, here are some ideas for the time being.

And here are a few more.

and for my favorite. One clever site I've run across has a good one. It has a question like "On our "about us" page, what is the street name of our support office?" or something like that. It takes a human to find the "About Us" page (the link doesn't say "about us" it says something similar that a person would figure out, though) And then to find the support office address,(different than main corporate office and several others listed on the page) you have to look through several matches. Current computer technology wouldn't be able to figure it out any more than it can figure out true speech recognition or cognition.

a Google search for "Captcha alternatives" turns up quite a bit.




回答3:


This cant be done without risking false positives (and annoying users).

How can we detect whether a human is viewing the site or a tool?

You cant. How would you handle tools parsing the page for a human, like screen readers and accessibility tools?

For example one way is by calculating the time up to which a user stays in page from which we can detect whether human intervention is involved. I do not know how to implement that but just thinking about this method. Can anyone help how to detect and prevent automated tools from scraping data from my website?

You wont detect automatic tools, only unusual behavior. And before you can define unusual behavior, you need to find what's usual. People view pages in different order, browser tabs allow them to do parallel tasks, etc.




回答4:


I guess the only good solution is to limit the rate that data can be accessed. It may not completely prevent scraping but at least you can limit the speed at which automated scraping tools will work, hopefully below a level that will discourage scraping the data.




回答5:


I should make a note that if there's a will, then there is a way.

That being said, I thought about what you've asked previously and here are some simple things I came up with:

  1. simple naive checks might be user-agent filtering and checking. You can find a list of common crawler user agents here: http://www.useragentstring.com/pages/Crawlerlist/

  2. you can always display your data in flash, though I do not recommend it.

  3. use a captcha

Other than that, I'm not really sure if there's anything else you can do but I would be interested in seeing the answers as well.

EDIT:

Google does something interesting where if you're looking for SSNs, after the 50th page or so, they will captcha. It begs the question to see whether or not you can intelligently time the amount a user spends on your page or if you want to introduce pagination into the equation, the time a user spends on one page.

Using the information that we previously assumed, it is possible to put a time limit before another HTTP request is sent. At that point, it might be beneficial to "randomly" generate a captcha. What I mean by this, is that maybe one HTTP request will go through fine, but the next one will require a captcha. You can switch those up as you please.




回答6:


The scrappers steal the data from your website by parsing URLs and reading the source code of your page. Following steps can be taken to atleast making scraping a bit difficult if not impossible.

Ajax requests make it difficult to parse the data and require extra efforts in getting the URLs to be parsed.

Use cookie even for the normal pages which do not require any authentication, create cookies once the user visits the home page and then its required for all the inner pages.This makes scraping a bit difficult.

Display the encrypted code on the website and then decrypt it on the loadtime using javascript code. I have seen it on a couple of websites.



来源:https://stackoverflow.com/questions/3518914/prevent-automated-tools-from-accessing-the-website

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!