Detect Search Crawlers via JavaScript

一世执手 提交于 2019-11-27 20:01:45

问题


I am wondering how would I go abouts in detecting search crawlers? The reason I ask is because I want to suppress certain JavaScript calls if the user agent is a bot.

I have found an example of how to to detect a certain browser, but am unable to find examples of how to detect a search crawler:

/MSIE (\d+\.\d+);/.test(navigator.userAgent); //test for MSIE x.x

Example of search crawlers I want to block:

Google 
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 
Googlebot/2.1 (+http://www.googlebot.com/bot.html) 
Googlebot/2.1 (+http://www.google.com/bot.html) 

Baidu 
Baiduspider+(+http://www.baidu.com/search/spider_jp.html) 
Baiduspider+(+http://www.baidu.com/search/spider.htm) 
BaiDuSpider 

回答1:


This is the regex the ruby UA agent_orange library uses to test if a userAgent looks to be a bot. You can narrow it down for specific bots by referencing the bot userAgent list here:

/bot|crawler|spider|crawling/i

For example you have some object, util.browser, you can store what type of device a user is on:

util.browser = {
   bot: /bot|googlebot|crawler|spider|robot|crawling/i.test(navigator.userAgent),
   mobile: ...,
   desktop: ...
}



回答2:


The following regex will match the biggest search engines according to this post.

/bot|google|baidu|bing|msn|duckduckbot|teoma|slurp|yandex/i
    .test(navigator.userAgent)

The matches search engines are:

  • Baidu
  • Bingbot/MSN
  • DuckDuckGo
  • Google
  • Teoma
  • Yahoo!
  • Yandex

Additionally, I've added bot as a catchall for smaller crawlers/bots.




回答3:


Try this. It's based on the crawlers list on available on https://github.com/monperrus/crawler-user-agents

var botPattern = "(googlebot\/|Googlebot-Mobile|Googlebot-Image|Google favicon|Mediapartners-Google|bingbot|slurp|java|wget|curl|Commons-HttpClient|Python-urllib|libwww|httpunit|nutch|phpcrawl|msnbot|jyxobot|FAST-WebCrawler|FAST Enterprise Crawler|biglotron|teoma|convera|seekbot|gigablast|exabot|ngbot|ia_archiver|GingerCrawler|webmon |httrack|webcrawler|grub.org|UsineNouvelleCrawler|antibot|netresearchserver|speedy|fluffy|bibnum.bnf|findlink|msrbot|panscient|yacybot|AISearchBot|IOI|ips-agent|tagoobot|MJ12bot|dotbot|woriobot|yanga|buzzbot|mlbot|yandexbot|purebot|Linguee Bot|Voyager|CyberPatrol|voilabot|baiduspider|citeseerxbot|spbot|twengabot|postrank|turnitinbot|scribdbot|page2rss|sitebot|linkdex|Adidxbot|blekkobot|ezooms|dotbot|Mail.RU_Bot|discobot|heritrix|findthatfile|europarchive.org|NerdByNature.Bot|sistrix crawler|ahrefsbot|Aboundex|domaincrawler|wbsearchbot|summify|ccbot|edisterbot|seznambot|ec2linkfinder|gslfbot|aihitbot|intelium_bot|facebookexternalhit|yeti|RetrevoPageAnalyzer|lb-spider|sogou|lssbot|careerbot|wotbox|wocbot|ichiro|DuckDuckBot|lssrocketcrawler|drupact|webcompanycrawler|acoonbot|openindexspider|gnam gnam spider|web-archive-net.com.bot|backlinkcrawler|coccoc|integromedb|content crawler spider|toplistbot|seokicks-robot|it2media-domain-crawler|ip-web-crawler.com|siteexplorer.info|elisabot|proximic|changedetection|blexbot|arabot|WeSEE:Search|niki-bot|CrystalSemanticsBot|rogerbot|360Spider|psbot|InterfaxScanBot|Lipperhey SEO Service|CC Metadata Scaper|g00g1e.net|GrapeshotCrawler|urlappendbot|brainobot|fr-crawler|binlar|SimpleCrawler|Livelapbot|Twitterbot|cXensebot|smtbot|bnf.fr_bot|A6-Indexer|ADmantX|Facebot|Twitterbot|OrangeBot|memorybot|AdvBot|MegaIndex|SemanticScholarBot|ltx71|nerdybot|xovibot|BUbiNG|Qwantify|archive.org_bot|Applebot|TweetmemeBot|crawler4j|findxbot|SemrushBot|yoozBot|lipperhey|y!j-asr|Domain Re-Animator Bot|AddThis)";
var re = new RegExp(botPattern, 'i');
var userAgent = 'Googlebot/2.1 (+http://www.googlebot.com/bot.html)';
if (re.test(userAgent)) {
    console.log('the user agent is a crawler!');
}



回答4:


The "test for MSIE x.x" example is just code for testing the userAgent against a Regular Expression. In your example the Regexp is the

/MSIE (\d+\.\d+);/

part. Just replace it with your own Regexp you want to test the user agent against. It would be something like

/Google|Baidu|Baiduspider/.test(navigator.userAgent)

where the vertical bar is the "or" operator to match the user agent against all of your mentioned robots. For more information about Regular Expression you can refer to this site since javascript uses perl-style RegExp.




回答5:


You have already disclosed what you will permit as user agents... But usually crawlers use various ACCEPTED user agents, as far as my experience goes you cannot limit any crawlers without affecting real users.



来源:https://stackoverflow.com/questions/20084513/detect-search-crawlers-via-javascript

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!