Robots.txt - What is the proper format for a Crawl Delay for multiple user agents?

烂漫一生 提交于 2019-12-06 18:44:45

问题


Below is a sample robots.txt file to Allow multiple user agents with multiple crawl delays for each user agent. The Crawl-delay values are for illustration purposes and will be different in a real robots.txt file.

I have searched all over the web for proper answers but could not find one. There are too many mixed suggestions and I do not know which is the correct / proper method.

Questions:

(1) Can each user agent have it's own crawl-delay? (I assume yes)

(2) Where do you put the crawl-delay line for each user agent, before or after the Allow / Dissallow line?

(3) Does there have to be a blank like between each user agent group.

References:

http://www.seopt.com/2013/01/robots-text-file/

http://help.yandex.com/webmaster/?id=1113851#1113858

Essentially, I am looking to find out how the final robots.txt file should look using the values in the sample below.

Thanks in advance.

# Allow only major search spiders    
User-agent: Mediapartners-Google
Disallow:
Crawl-delay: 11

User-agent: Googlebot
Disallow:
Crawl-delay: 12

User-agent: Adsbot-Google
Disallow:
Crawl-delay: 13

User-agent: Googlebot-Image
Disallow:
Crawl-delay: 14

User-agent: Googlebot-Mobile
Disallow:
Crawl-delay: 15

User-agent: MSNBot
Disallow:
Crawl-delay: 16

User-agent: bingbot
Disallow:
Crawl-delay: 17

User-agent: Slurp
Disallow:
Crawl-delay: 18

User-agent: Yahoo! Slurp
Disallow:
Crawl-delay: 19

# Block all other spiders
User-agent: *
Disallow: /

# Block Directories for all spiders
User-agent: *
Disallow: /ads/
Disallow: /cgi-bin/
Disallow: /scripts/

(4) If I want to set all of the user agents to have crawl delay of 10 seconds, would the following be correct?

# Allow only major search spiders
User-agent: *
Crawl-delay: 10

User-agent: Mediapartners-Google
Disallow:

User-agent: Googlebot
Disallow:

User-agent: Adsbot-Google
Disallow:

User-agent: Googlebot-Image
Disallow:

User-agent: Googlebot-Mobile
Disallow:

User-agent: MSNBot
Disallow:

User-agent: bingbot
Disallow:

User-agent: Slurp
Disallow:

User-agent: Yahoo! Slurp
Disallow:

# Block all other spiders
User-agent: *
Disallow: /

# Block Directories for all spiders
User-agent: *
Disallow: /ads/
Disallow: /cgi-bin/
Disallow: /scripts/

回答1:


(1) Can each user agent have it's own crawl-delay?

Yes. Each record, started by one or more User-agent lines, can have a Crawl-delay line. Note that Crawl-delay is not part of the original robots.txt specification. But it’s no problem to include them for those parsers that understand it, as the spec defines:

Unrecognised headers are ignored.

So older robots.txt parsers will simply ignore your Crawl-delay lines.


(2) Where do you put the crawl-delay line for each user agent, before or after the Allow / Dissallow line?

Doesn’t matter.


(3) Does there have to be a blank like between each user agent group.

Yes. Records have to be separated by one or more new lines. See the original spec:

The file consists of one or more records separated by one or more blank lines (terminated by CR,CR/NL, or NL).


(4) If I want to set all of the user agents to have crawl delay of 10 seconds, would the following be correct?

No. Bots look for records that match their user-agent. Only if they don’t find a record, they will use the User-agent: * record. So in your example all the listed bots (like Googlebot, MSNBot, Yahoo! Slurp etc.) will have no Crawl-delay.


Also note that you can’t have several records with User-agent: *:

If the value is '*', the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file.

So parsers might look (if no other record matched) for the first record with User-agent: * and ignore the following ones. For your first example that would mean that URLs beginning with /ads/, /cgi-bin/ and /scripts/ are not blocked.

And even if you have only one record with User-agent: *, those Disallow lines are only for bots that have no other record match! As your comment # Block Directories for all spiders suggest, you want these URL paths to be blocked for all spiders, so you’d have to repeat the Disallow lines for every record.



来源:https://stackoverflow.com/questions/17377835/robots-txt-what-is-the-proper-format-for-a-crawl-delay-for-multiple-user-agent

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!