How to stop search engines from crawling the whole website?

问题

I want to stop search engines from crawling my whole website.

I have a web application for members of a company to use. This is hosted on a web server so that the employees of the company can access it. No one else (the public) would need it or find it useful.

So I want to add another layer of security (In Theory) to try and prevent unauthorized access by totally removing access to it by all search engine bots/crawlers. Having Google index our site to make it searchable is pointless from the business perspective and just adds another way for a hacker to find the website in the first place to try and hack it.

I know in the robots.txt you can tell search engines not to crawl certain directories.

Is it possible to tell bots not to crawl the whole site without having to list all the directories not to crawl?

Is this best done with robots.txt or is it better done by .htaccess or other?

回答1:

It is best handled with a robots.txt file, for just bots that respect the file.

To block the whole site add this to robots.txt in the root directory of your site:

User-agent: *
Disallow: /

To limit access to your site for everyone else, .htaccess is better, but you would need to define access rules, by IP address for example.

Below are the .htaccess rules to restrict everyone except your people from your company IP:

Order allow,deny
# Enter your companies IP address here
Allow from 255.1.1.1
Deny from all

回答2:

Using robots.txt to keep a site out of search engine indexes has one minor and little-known problem: if anyone ever links to your site from any page indexed by Google (which would have to happen for Google to find your site anyway, robots.txt or not), Google may still index the link and show it as part of their search results, even if you don't allow them to fetch the page the link points to.

If this might be a problem for you, the solution is to not use robots.txt, but instead to include a robots meta tag with the value noindex,nofollow on every page on your site. You can even do this in a .htaccess file using mod_headers and the X-Robots-Tag HTTP header:

Header set X-Robots-Tag noindex,nofollow

This directive will add the header X-Robots-Tag: noindex,nofollow to every page it applies to, including non-HTML pages like images. Of course, you may want to include the corresponding HTML meta tag too, just in case (it's an older standard, and so presumably more widely supported):

<meta name="robots" content="noindex,nofollow" />

Note that if you do this, Googlebot will still try to crawl any links it finds to your site, since it needs to fetch the page before it sees the header / meta tag. Of course, some might well consider this a feature instead of a bug, since it lets you look in your access logs to see if Google has found any links to your site.

In any case, whatever you do, keep in mind that it's hard to keep a "secret" site secret very long. As time passes, the probability that one of your users will accidentally leak a link to the site approaches 100%, and if there's any reason to assume that someone would be interested in finding the site, you should assume that they will. Thus, make sure you also put proper access controls on your site, keep the software up to date and run regular security checks on it.

回答3:

If security is your concern, and locking down to IP addresses isn't viable, you should look into requiring your users to authenticate in someway to access your site.

That would mean that anyone (google, bot, person-who-stumbled-upon-a-link) who isn't authenticated, wouldn't be able to access your pages.

You could bake it into your website itself, or use HTTP Basic Authentication.

https://www.httpwatch.com/httpgallery/authentication/

来源：https://stackoverflow.com/questions/9102914/how-to-stop-search-engines-from-crawling-the-whole-website

标签

security

.htaccess

robots.txt