Prevent Custom Web Crawler from being blocked

↘锁芯ラ 提交于 2019-12-04 14:35:34

问题


I am creating a new web crawler using C# to crawl some specific websites. every thing goes fine. but the problem is that some websites are blocking my crawler IP address after some requests. I tried using timestamps between my crawl requests. but did not worked.

is there any way to prevent websites from blocking my crawler ? some solutions like this would help (but I need to know how to apply them):

  • simulating Google bot or yahoo slurp
  • using multiple IP addresses (event fake IP addresses) as crawler client IP

any solution would help.


回答1:


If speed/throughput is not a huge concern, then probably the best solution is to install Tor and Privoxy and route your crawler through that. Then your crawler will have a randomly changing IP address.

This is a very effective technique if you need to crawl sites that do not want you crawling them. It also provides a layer of protection/anonymity by making the activities of your crawler very difficult to trace back to you.

Of course, if sites are blocking your crawler because it is just going too fast, then perhaps you should just rate-limit it a bit.




回答2:


And this is how you block fakers (just in case someone found this page while searching how to block those)

Block that trick in apache:

# Block fake google when it's not coming from their IP range's 
# (A fake googlebot) [F] => Failure
RewriteCond %{HTTP:X-FORWARDED-FOR} !^66\.249\.(6[4-9]|[78][0-9]|9[0-5])\.
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ \(compatible;\ Googlebot/2\.[01];\ \+http://www\.google\.com/bot\.html\)$ [NC]
RewriteRule .* - [F,L]

Or a block in nginx for completeness sake

   map_hash_bucket_size  1024;
   map_hash_max_size     102400;

   map $http_user_agent $is_bot {
      default 0;
      ~(crawl|Googlebot|Slurp|spider|bingbot|tracker|click|parser|spider)$ 1;
   }

   geo $not_google {
      default     1;
      66.0.0.0/8  0;
   }

   map $http_user_agent $bots {
      default           0;
      ~(?i)googlebot       $not_google;
   }


来源:https://stackoverflow.com/questions/7644036/prevent-custom-web-crawler-from-being-blocked

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!