Increase number of threads in crawler

99封情书 提交于 2019-12-23 05:42:09

问题


This is the code taken from http://code.google.com/p/crawler4j/ and the name of this file is MyCrawler.java


public class MyCrawler extends WebCrawler {

        Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g"
                + "|png|tiff?|mid|mp2|mp3|mp4"
                + "|wav|avi|mov|mpeg|ram|m4v|pdf"
                + "|rm|smil|wmv|swf|wma|zip|rar|gz))$");

        /*
         * You should implement this function to specify
         * whether the given URL should be visited or not.
         */
        public boolean shouldVisit(WebURL url) {
                String href = url.getURL().toLowerCase();
                if (filters.matcher(href).matches()) {
                        return false;
                }
                if (href.startsWith("http://www.xyz.us.edu/")) {
                        return true;
                }
                return false;
        }

        /*
         * This function is called when a page is fetched
         * and ready to be processed by your program
         */
        public void visit(Page page) {
                int docid = page.getWebURL().getDocid();
                String url = page.getWebURL().getURL();         
                String text = page.getText();
                List<WebURL> links = page.getURLs();            
        }
}

And this is the code for Controller.java from where MyCrawler is getting called..

public class Controller {
        public static void main(String[] args) throws Exception {
                CrawlController controller = new CrawlController("/data/crawl/root");
                controller.addSeed("http://www.xyz.us.edu/");
                controller.start(MyCrawler.class, 10);  
        }
}

So I just want to make sure what does this line means in controller.java file

controller.start(MyCrawler.class, 10);

here what is the meaning of 10.. And if we Increase this 10 to 20 then what will be the effect... Any suggestions will be appreciated...


回答1:


This website shows the source for CrawlController.

Incrementing from 10 to 20 increases the number of crawlers (each in their own thread) - studying that code will tell you what affect this will have.




回答2:


Given the name you put on the post, you appear to already know what this does - it sets the number of crawler threads. As for what effect it will have... that depends largely on how much of the time each thread will be waiting for I/O - mostly network, and a little disk, and on how much CPU and disk throughput you have. Peak throughput will happen when one of these happens:

  • no more CPU time left
  • no more network bandwidth
  • no more disk bandwidth

For CPU, don't expect to get to 100% - figure 80% or so max.



来源:https://stackoverflow.com/questions/6683764/increase-number-of-threads-in-crawler

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!