Detect Search Crawlers via JavaScript

前端未结

关注

 5  1376

I am wondering how would I go abouts in detecting search crawlers? The reason I ask is because I want to suppress certain JavaScript calls if the user agent is a bot.

相关标签:

5条回答

清歌不尽

2020-12-13 00:11

Try this. It's based on the crawlers list on available on https://github.com/monperrus/crawler-user-agents

var botPattern = "(googlebot\/|bot|Googlebot-Mobile|Googlebot-Image|Google favicon|Mediapartners-Google|bingbot|slurp|java|wget|curl|Commons-HttpClient|Python-urllib|libwww|httpunit|nutch|phpcrawl|msnbot|jyxobot|FAST-WebCrawler|FAST Enterprise Crawler|biglotron|teoma|convera|seekbot|gigablast|exabot|ngbot|ia_archiver|GingerCrawler|webmon |httrack|webcrawler|grub.org|UsineNouvelleCrawler|antibot|netresearchserver|speedy|fluffy|bibnum.bnf|findlink|msrbot|panscient|yacybot|AISearchBot|IOI|ips-agent|tagoobot|MJ12bot|dotbot|woriobot|yanga|buzzbot|mlbot|yandexbot|purebot|Linguee Bot|Voyager|CyberPatrol|voilabot|baiduspider|citeseerxbot|spbot|twengabot|postrank|turnitinbot|scribdbot|page2rss|sitebot|linkdex|Adidxbot|blekkobot|ezooms|dotbot|Mail.RU_Bot|discobot|heritrix|findthatfile|europarchive.org|NerdByNature.Bot|sistrix crawler|ahrefsbot|Aboundex|domaincrawler|wbsearchbot|summify|ccbot|edisterbot|seznambot|ec2linkfinder|gslfbot|aihitbot|intelium_bot|facebookexternalhit|yeti|RetrevoPageAnalyzer|lb-spider|sogou|lssbot|careerbot|wotbox|wocbot|ichiro|DuckDuckBot|lssrocketcrawler|drupact|webcompanycrawler|acoonbot|openindexspider|gnam gnam spider|web-archive-net.com.bot|backlinkcrawler|coccoc|integromedb|content crawler spider|toplistbot|seokicks-robot|it2media-domain-crawler|ip-web-crawler.com|siteexplorer.info|elisabot|proximic|changedetection|blexbot|arabot|WeSEE:Search|niki-bot|CrystalSemanticsBot|rogerbot|360Spider|psbot|InterfaxScanBot|Lipperhey SEO Service|CC Metadata Scaper|g00g1e.net|GrapeshotCrawler|urlappendbot|brainobot|fr-crawler|binlar|SimpleCrawler|Livelapbot|Twitterbot|cXensebot|smtbot|bnf.fr_bot|A6-Indexer|ADmantX|Facebot|Twitterbot|OrangeBot|memorybot|AdvBot|MegaIndex|SemanticScholarBot|ltx71|nerdybot|xovibot|BUbiNG|Qwantify|archive.org_bot|Applebot|TweetmemeBot|crawler4j|findxbot|SemrushBot|yoozBot|lipperhey|y!j-asr|Domain Re-Animator Bot|AddThis)";
var re = new RegExp(botPattern, 'i');
var userAgent = navigator.userAgent; 
if (re.test(userAgent)) {
    console.log('the user agent is a crawler!');
}

0 讨论(0)

-上瘾入骨i

2020-12-13 00:26
The following regex will match the biggest search engines according to this post.
```
/bot|google|baidu|bing|msn|teoma|slurp|yandex/i
    .test(navigator.userAgent)
```
The matches search engines are:
- Baidu
- Bingbot/MSN
- DuckDuckGo (duckduckbot)
- Google
- Teoma
- Yahoo!
- Yandex
Additionally, I've added bot as a catchall for smaller crawlers/bots.
0 讨论(0)
发布评论:

提交评论
- 加载中...
逝去的感伤

2020-12-13 00:33
The "test for MSIE x.x" example is just code for testing the userAgent against a Regular Expression. In your example the Regexp is the
```
/MSIE (\d+\.\d+);/
```
part. Just replace it with your own Regexp you want to test the user agent against. It would be something like
```
/Google|Baidu|Baiduspider/.test(navigator.userAgent)
```
where the vertical bar is the "or" operator to match the user agent against all of your mentioned robots. For more information about Regular Expression you can refer to this site since javascript uses perl-style RegExp.
0 讨论(0)
发布评论:

提交评论
- 加载中...
感动是毒

2020-12-13 00:33
isTrusted property could help you.

The isTrusted read-only property of the Event interface is a Boolean that is true when the event was generated by a user action, and false when the event was created or modified by a script or dispatched via EventTarget.dispatchEvent().

eg:
```
isCrawler() {
  return event.isTrusted;
}
```
⚠ Note that IE isn't compatible.

Read more from doc: https://developer.mozilla.org/en-US/docs/Web/API/Event/isTrusted
0 讨论(0)
发布评论:

提交评论
- 加载中...
终归单人心

2020-12-13 00:36
This is the regex the ruby UA agent_orange library uses to test if a userAgent looks to be a bot. You can narrow it down for specific bots by referencing the bot userAgent list here:
```
/bot|crawler|spider|crawling/i
```
For example you have some object, util.browser, you can store what type of device a user is on:
```
util.browser = {
   bot: /bot|googlebot|crawler|spider|robot|crawling/i.test(navigator.userAgent),
   mobile: ...,
   desktop: ...
}
```
0 讨论(0)
发布评论:

提交评论
- 加载中...