Disallow directory contents, but Allow directory page in robots.txt

允我心安 提交于 2019-12-02 04:19:06

问题


Will this work for disallowing pages under a directory, but still allow a page on that directory url?

Allow: /special-offers/$
Disallow: /special-offers/

to allow:

www.mysite.com/special-offers/

but block:

www.mysite.com/special-offers/page1

www.mysite.com/special-offers/page2.html

etc


回答1:


Having looked at Google's very own robots.txt file, they are doing exactly what I was questioning.

At line 136-137 they have:

Disallow: /places/
Allow: /places/$

So they are blocking any thing under places, but allowing the root places URL. The only difference with my syntax is the order, the Disallow being first.




回答2:


Standards

According to the HTML 4.01 specification, Appendix B.4.1 the values allowed in Disallow (no pun intended) are partial URIs (representing partial or full paths), only:

The "Disallow" field specifies a partial URI that is not to be visited. This can be a full path, or a partial path; any URI that starts with this value will not be retrieved. For example,

Disallow: /help disallows both /help.html and /help/index.html, whereas

Disallow: /help/ would disallow /help/index.html but allow /help.html.

I don't think anything has changed since then, since current HTML5 Specification Drafts don't mention robots.txt at all.

Extensions

However, in practice, many Robot Engines (such as Googlebot) are more flexible in what they accept. If you use, for instance:

Disallow: /*.gif$

then Googlebot will skip any file with the gif extension. I think you could do something like this to disallow all files under a folder, but I'm not 100% sure (you could test them with Google Webmaster Tools):

Disallow: /special-offers/*.*$

Other options

Anyway, you shouldn't rely on this too much (since each search engine might behave differently), so if possible it would be preferrable to use meta tags or HTTP headers instead. For instance, you could configure your webserver to include this header in all responses that should not be indexed (or followed):

X-Robots-Tag: noindex, nofollow

Search for the best way of doing it in your particular webserver. Here's an example in Apache, combining mod_rewrite with mod_headers to conditionally set some headers depending on the URL pattern. Disclaimer: I haven't tested it myself, so I can't tell how well it works.

# all /special-offers/ sub-urls set env var ROBOTS=none
RewriteRule ^/special-offers/.+$ - [E=ROBOTS:none]

# if env var ROBOTS is set then create header X-Robots-Tag: $ENV{ROBOTS}
RequestHeader set X-Robots-Tag %{ROBOTS}e env=ROBOTS

(Note: none is equivalent to noindex, nofollow)



来源:https://stackoverflow.com/questions/14190893/disallow-directory-contents-but-allow-directory-page-in-robots-txt

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!