robots.txt | 易学教程

Blocking bots by modifying htaccess

阅读更多关于 Blocking bots by modifying htaccess

I am trying to block a couple bots via my htaccess file. On Search Engine Watch it is recommended to use the below. I did block these bots in the robots.txt file but they are ignoring it. Here is code from Search Engine Watch: RewriteEngine on Options +FollowSymlinks RewriteBase / RewriteCond %{HTTP_USER_AGENT} ^Baiduspider [NC,OR” RewriteCond %{HTTP_USER_AGENT} ^Sogou RewriteRule ^.*$ - [F” My current htaccess file is as below. How exactly would I modify my current .htaccess with the above code? What do I add and where do I add the above? I assume some of the above is already included in my

robots.txt : how to disallow subfolders of dynamic folder

阅读更多关于 robots.txt : how to disallow subfolders of dynamic folder

I have urls like these: /products/:product_id/deals/new /products/:product_id/deals/index I'd like to disallow the "deals" folder in my robots.txt file. [Edit] I'd like to disallow this folder for Google, Yahoo and Bing Bots. Does anyone know if these bots support wildcard character and so would support the following rule? Disallow: /products/*/deals Also... Do you have any really good tuto on robots.txt rules? As I didn't manage to find a "really" good one I could use one... And one last question: Is the robots.txt the best way to handle this? Or should I better use the "noindex" meta? Thx

Is this Anti-Scraping technique viable with Robots.txt Crawl-Delay?

阅读更多关于 Is this Anti-Scraping technique viable with Robots.txt Crawl-Delay?

问题 I want to prevent web scrapers from agressively scraping 1,000,000 pages on my website. I'd like to do this by returning a "503 Service Unavailable" HTTP error code to bots that access an abnormal number of pages per minute. I'm not having trouble with form-spammers, just with scrapers. I don't want search engine spiders to ever receive the error. My inclination is to set a robots.txt crawl-delay which will ensure spiders access a number of pages per minute under my 503 threshold. Is this an

Does the user agent string have to be exactly as it appears in my server logs?

阅读更多关于 Does the user agent string have to be exactly as it appears in my server logs?

问题 When using a Robots.txt file, does the user agent string have to be exactly as it appears in my server logs? For example when trying to match GoogleBot, can I just use googlebot ? Also, will a partial-match work? For example just using Google ? 回答1: Yes, the user agent has to be an exact match. From robotstxt.org: "globbing and regular expression are not supported in either the User-agent or Disallow lines" 回答2: At least for googlebot, the user-agent is non-case-sensitive. Read the 'Order of

How to replace robots.txt with .htaccess

阅读更多关于 How to replace robots.txt with .htaccess

I have a small situation where i have to remove my robots.txt file because i don't want and robots crawlers to get the links. Also i want them to be accessible by the user and i don't want them to be cached by the search engines. Also i cannot add any user authentications for various reasons. So i am thinking about using mod-rewrite to disable search engine crawlers from crawling it while allowing all others to do it. The logic i am trying to implement is write a condition to check if the incomming user agent is a search engine and if yes then re-direct them to 401. The only problem is i don't

Anybody got any C# code to parse robots.txt and evaluate URLS against it

阅读更多关于 Anybody got any C# code to parse robots.txt and evaluate URLS against it

问题 Short question: Has anybody got any C# code to parse robots.txt and then evaluate URLS against it so see if they would be excluded or not. Long question: I have been creating a sitemap for a new site yet to be released to google. The sitemap has two modes, a user mode (like a traditional sitemap) and an 'admin' mode. The admin mode will show all possible URLS on the site, including customized entry URLS or URLS for a specific outside partner - such as example.com/oprah for anyone who sees our

robots.txt URL format

阅读更多关于 robots.txt URL format

问题 According to this page globbing and regular expression are not supported in either the User-agent or Disallow lines However, I noticed that the stackoverflow robots.txt includes characters like * and ? in the URLs. Are these supported or not? Also, does it make any difference whether a URL includes a trailing slash, or are these two equivalent? Disallow: /privacy Disallow: /privacy/ 回答1: Your second question, the two are not equivalent. /privacy will block anything that starts with /privacy ,

Why does Chrome request a robots.txt?

阅读更多关于 Why does Chrome request a robots.txt?

问题 I have noticed in my logs that Chrome requested a robots.txt alongside everything I expected it to. [...] 2017-09-17 15:22:35 - (sanic)[INFO]: Goin' Fast @ http://0.0.0.0:8080 2017-09-17 15:22:35 - (sanic)[INFO]: Starting worker [26704] 2017-09-17 15:22:39 - (network)[INFO][127.0.0.1:36312]: GET http://localhost:8080/ 200 148 2017-09-17 15:22:39 - (sanic)[ERROR]: Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/sanic/app.py", line 493, in handle_request handler,

robots.txt parser java

阅读更多关于 robots.txt parser java

问题 I want to know how to parse the robots.txt in java. Is there already any code? 回答1: Heritrix is an open-source web crawler written in Java. Looking through their javadoc, I see that they have a utility class Robotstxt for parsing the robots.txt file. 回答2: There's also jrobotx library hosted at SourceForge. (Full disclosure: I spun off the code that forms that library.) 回答3: There is also a new release of crawler-commons: https://github.com/crawler-commons/crawler-commons The library aims to

How to set Robots.txt or Apache to allow crawlers only at certain hours?

阅读更多关于 How to set Robots.txt or Apache to allow crawlers only at certain hours?

As traffic is distributed unevenly over 24 hours, I would like to disallow crawlers during peek hours and allow them at non-busy hours. Is there a method to achieve this? edit: thanks for all the good advice. This is another solution we found. 2bits.com has an article on setting IPTables firewall to limit the number of connections from certain IP addresses. the article the setting of IPTables: Using connlimit In newer Linux kernels, there is a connlimit module for iptables. It can be used like this: iptables -I INPUT -p tcp -m connlimit --connlimit-above 5 -j REJECT This limits connections