Calling Controller.Start in loop in Crawler4j?

烂漫一生 提交于 2019-12-11 21:59:03

问题


I asked one question here. But this is kind of other question that sounds similar.

Using crawler4j, I want to crawl multiple seed urls with restriction on domain name (that is domain name check in shouldVisit). Here is an example of how to do it. In short, you set list of domain names using customData and then pass it to crawler class (from controller) and in shouldVisit function, we loop through this data (which is a list, see linked url) to see if domain name is there in list, if so return true.

There is a glitch in this. If google.com and yahoo.com are there in the names of seed url domain list and www.yahoo.com/xyz links to www.google.com/zyx , it will crawl the page, because www.google.com is there in our domains-to-visit list. Also, a for loop in shouldVisit could be heavy if number of seed urls is huge (thousands) and it will consume some memory as well.

To counter this, I can think of a looping through seed urls. This is how it may look like :

while(s.next()){
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
controller.addSeed(some-seed-url);
controller.setCustomData(domain-name-of-seed-url-to-be-checked-in-shouldVisit);
controller.start(MyCrawler.class, numberOfCrawlers);    


}

I am not sure if this is a terrible idea, but is there any advantage/disadvantage of doing it in performance terms ? other concerns ?

Edit :

I tested it, and it seems like this approach consumes too much time (probably in opening and closing instances of controller in each loop.) Wish there is some other solution.


回答1:


try the solution I found in a related subject:

As of version 3.0, this feature is implemented in crawler4j. Please visit http://code.google.com/p/crawler4j/source/browse/src/test/java/edu/uci/ics/crawler4j/examples/multiple/ for an example usage.

Basically, you need to start the controller in non-blocking mode:

controller.startNonBlocking(MyCrawler.class, numberOfThreads);

Then you can add your seeds in a loop. Note that you don't need to start the controller several times in a loop.

Hope it helps!



来源:https://stackoverflow.com/questions/19875771/calling-controller-start-in-loop-in-crawler4j

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!