问题
This is the code taken from http://code.google.com/p/crawler4j/ and the name of this file is MyCrawler.java
public class MyCrawler extends WebCrawler {
Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g"
+ "|png|tiff?|mid|mp2|mp3|mp4"
+ "|wav|avi|mov|mpeg|ram|m4v|pdf"
+ "|rm|smil|wmv|swf|wma|zip|rar|gz))$");
/*
* You should implement this function to specify
* whether the given URL should be visited or not.
*/
public boolean shouldVisit(WebURL url) {
String href = url.getURL().toLowerCase();
if (filters.matcher(href).matches()) {
return false;
}
if (href.startsWith("http://www.xyz.us.edu/")) {
return true;
}
return false;
}
/*
* This function is called when a page is fetched
* and ready to be processed by your program
*/
public void visit(Page page) {
int docid = page.getWebURL().getDocid();
String url = page.getWebURL().getURL();
String text = page.getText();
List<WebURL> links = page.getURLs();
}
}
And this is the code for Controller.java from where MyCrawler is getting called..
public class Controller {
public static void main(String[] args) throws Exception {
CrawlController controller = new CrawlController("/data/crawl/root");
controller.addSeed("http://www.xyz.us.edu/");
controller.start(MyCrawler.class, 10);
}
}
So I just want to make sure what does this line means in controller.java file
controller.start(MyCrawler.class, 10);
here what is the meaning of 10.. And if we Increase this 10 to 20 then what will be the effect... Any suggestions will be appreciated...
回答1:
This website shows the source for CrawlController.
Incrementing from 10 to 20 increases the number of crawlers (each in their own thread) - studying that code will tell you what affect this will have.
回答2:
Given the name you put on the post, you appear to already know what this does - it sets the number of crawler threads. As for what effect it will have... that depends largely on how much of the time each thread will be waiting for I/O - mostly network, and a little disk, and on how much CPU and disk throughput you have. Peak throughput will happen when one of these happens:
- no more CPU time left
- no more network bandwidth
- no more disk bandwidth
For CPU, don't expect to get to 100% - figure 80% or so max.
来源:https://stackoverflow.com/questions/6683764/increase-number-of-threads-in-crawler