How to set Robots.txt or Apache to allow crawlers only at certain hours?

浪子不回头ぞ 提交于 2020-01-02 05:05:10

问题


As traffic is distributed unevenly over 24 hours, I would like to disallow crawlers during peek hours and allow them at non-busy hours.

Is there a method to achieve this?

edit: thanks for all the good advice.

This is another solution we found.

2bits.com has an article on setting IPTables firewall to limit the number of connections from certain IP addresses.

the article

the setting of IPTables:

  • Using connlimit

In newer Linux kernels, there is a connlimit module for iptables. It can be used like this:

iptables -I INPUT -p tcp -m connlimit --connlimit-above 5 -j REJECT

This limits connections from each IP address to no more than 5 simultaneous connections. This sort of "rations" connections, and prevents crawlers from hitting the site simultaneously. *


回答1:


You can't control that in the robots.txt file. It's possible that some crawlers might support something like that, but none of the big ones do (as far as I know).

Dynamically changing the robots.txt file is also a bad idea in a case like this. Most crawlers cache the robots.txt file for a certain time, and continue using it until they refresh the cache. If they cache it at the "right" time, they might crawl normally all day. If they cache it at the "wrong" time, they would stop crawling altogether (and perhaps even remove indexed URLs from their index). For instance, Google generally caches the robots.txt file for a day, meaning that changes during the course of a day would not be visible to Googlebot.

If crawling is causing too much load on your server, you can sometimes adjust the crawl rate for individual crawlers. For instance, for Googlebot you can do this in Google Webmaster Tools.

Additionally, when crawlers attempt to crawl during times of high load, you can always just serve them a 503 HTTP result code. This tells crawlers to check back at some later time (you can also specify a retry-after HTTP header if you know when they should come back). While I'd try to avoid doing this strictly on a time-of-day basis (this can block many other features, such as Sitemaps, contextual ads, or website verification and can slow down crawling in general), in exceptional cases it might make sense to do that. For the long run, I'd strongly recommend only doing this when your server load is really much too high to successfully return content to crawlers.




回答2:


You cannot determine what time the crawlers do their work, however with Crawl-delay you may be able to reduce the frequency in which they request pages. This can be useful to prevent them from rapidly requesting pages.

For Example:

User-agent: *
Crawl-delay: 5



回答3:


This is not possible using some robots.txt syntax - the feature simply isn't there.

You might be able to influence crawlers by actually altering the robots.txt file depending on the time of day. I expect Google will check the file immediately before crawling, for example. But obviously, there is the huge danger of scaring crawlers away for good that way - the risk of that being probably more problematic than whatever load you get right now.




回答4:


I don't think you can make an appointment with a search engine spider.




回答5:


First Let It Be Clear:

Blockquote

Dynamically changing the robots.txt file is also a bad idea in a case like this. Most crawlers cache the robots.txt file for a certain time, and continue using it until they refresh the cache. If they cache it at the "right" time, they might crawl normally all day. If they cache it at the "wrong" time, they would stop crawling altogether (and perhaps even remove indexed URLs from their index). For instance, Google generally caches the robots.txt file for a day, meaning that changes during the course of a day would not be visible to Googlebot. answered Jan 22 '11 at 14:25 John Mueller

I tried doing a chron job renaming the robot.txt file during the week. Like an on / off switch. It working say every Monday at Midnight it rename's the "robot.txt" to "def-robot.txt" which now it is not blocking crawlers. I allow two to three days then I have another schedule chron job to rename it back to "robot.txt" for "def-robot.txt" which now its starts blocking any crawler from accessing my sites. So their is a long way of doing this but the first mentioned is exactly is what is happening to me.

I have had major decrease if not all in my indexed links because the GoogleBot could not verify the link to still be correct because "robot.txt is blocking Google from accessing my site half the week. Simple. Chron job scheduling to change the file to the customizations you want could work some what. Thats the only way I have found to customize robot.txt on a scheduled time bases.



来源:https://stackoverflow.com/questions/4730376/how-to-set-robots-txt-or-apache-to-allow-crawlers-only-at-certain-hours

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!