how to identify web crawlers of google/yahoo/msn by PHP?

前端 未结 8 1585
清酒与你
清酒与你 2020-12-29 17:51

AFAIK,

$_SERVER[\'REMOTE_HOST\'] should end up with \"google.com\" or \"yahoo.com\".

but is it the most ensuring method?

any other way out?

8条回答
  •  情歌与酒
    2020-12-29 18:02

    First of all, I hope you're not trying to do this in order to serve search engine bots different content than your site contains for normal users. If they discover you doing this, your site will get removed from their listings entirely. So long as you understand the risks of it, you can usually find information about what unique user-agent they will use:

    • Verifying Googlebot (use user-agent, reverse DNS if you want to be sure)
    • Yahoo's user agent will contain "Slurp"

    However, some people writing (usually poorly-behaved) web scrapers will set their User Agent strings to be the same as "legitimate" crawlers such as Google's. You can catch these by doing lookups on the bot's IP address/hostname to ensure that they actually are coming from Google/Yahoo/etc. Some more info about what to look for in hostname lookups (from this article):

    • Google crawlers will end with googlebot.com like in crawl-66-249-70-244.googlebot.com.
    • Yahoo crawlers will end with crawl.yahoo.net like in llf520064.crawl.yahoo.net.
    • Live Search crawlers will end with search.msn.com like in msnbot-65-55-104-161.search.msn.com.
    • Ask crawlers will end with ask.com like in crawler4037.ask.com.

提交回复
热议问题