Is the User-Agent line in robots.txt an exact match or a substring match?

后端 未结 2 1599
你的背包
你的背包 2020-12-21 04:30

When a crawler reads the User-Agent line of a robots.txt file, does it attempt to match it exactly to its own User-Agent or does it attempt to match it as a substring of its

相关标签:
2条回答
  • 2020-12-21 04:50

    Every robot does this a little differently. There is really no single reliable way to map the user-agent in robots.txt to the user-agent sent in the request headers. The safest thing to do is to treat them as two separate, arbitrary strings. The only 100% reliable way to find the robots.txt user-agent is to read the official documentation for the given robot.

    Edit:

    Your best bet is generally to read the official documentation for the given robot, but even this is not 100% accurate. As Michael Marr points out, Google has a robots.txt testing tool that can be used to verify which UA will work with a given robot. This tool reveals that their documentation is inaccurate. Specifically, the page https://developers.google.com/webmasters/control-crawl-index/docs/ claims that their media partner bots respond to the 'Googlebot' UA, but the tool shows that they don't.

    0 讨论(0)
  • 2020-12-21 04:56

    In the original robots.txt specification (from 1994), it says:

    User-agent

    […]

    The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.

    […]

    If and which bots/parsers comply with this is another question and can’t be answered in general.

    0 讨论(0)
提交回复
热议问题