发表新帖

发表新帖

how to identify web crawlers of google/yahoo/msn by PHP?

前端未结

关注

 8  1585

清酒与你 2020-12-29 17:51

AFAIK,

$_SERVER[\'REMOTE_HOST\'] should end up with \"google.com\" or \"yahoo.com\".

but is it the most ensuring method?

any other way out?

8条回答

情歌与酒 (楼主)

2020-12-29 18:02
First of all, I hope you're not trying to do this in order to serve search engine bots different content than your site contains for normal users. If they discover you doing this, your site will get removed from their listings entirely. So long as you understand the risks of it, you can usually find information about what unique user-agent they will use:
- Verifying Googlebot (use user-agent, reverse DNS if you want to be sure)
- Yahoo's user agent will contain "Slurp"
However, some people writing (usually poorly-behaved) web scrapers will set their User Agent strings to be the same as "legitimate" crawlers such as Google's. You can catch these by doing lookups on the bot's IP address/hostname to ensure that they actually are coming from Google/Yahoo/etc. Some more info about what to look for in hostname lookups (from this article):
- Google crawlers will end with googlebot.com like in crawl-66-249-70-244.googlebot.com.
- Yahoo crawlers will end with crawl.yahoo.net like in llf520064.crawl.yahoo.net.
- Live Search crawlers will end with search.msn.com like in msnbot-65-55-104-161.search.msn.com.
- Ask crawlers will end with ask.com like in crawler4037.ask.com.
0 讨论(0)

查看其它8个回答
发布评论:

提交评论
- 加载中...

热议问题