How to schedule crawler4j crawl control to run periodically?

五迷三道 提交于 2019-12-12 01:48:26

问题


I'm using crawler4j to build a simple web crawler. What I want to do is to invoke the crawl control every 10 minutes. I created a servlet that starts when my Tomcat server starts, and in the servlet I am using ScheduledExecutorService for the scheduling. However, the crawl control only fetches me data ONCE (not every 10 mins like I wanted). Is there a better way to schedule my crawl to execute every 10 mins? Below is my code in the servlet.

public class ScheduleControl extends HttpServlet {

    private final static ScheduledExecutorService scheduler =   Executors.newScheduledThreadPool(1);

    @Override
    public void init() throws ServletException {

    final Runnable crawler = new Runnable() {

        @Override
        public void run() {
            String[] args = {"/Users/kevin/Desktop", "7"};
            try {
                SaleCrawlControl.main(args);
            } catch (Exception e) {
                System.out.println("Exception " + e);
            }
        }
    };

    final ScheduledFuture crawlerHandle = scheduler.scheduleAtFixedRate(crawler, 0, 10, MINUTES);

    scheduler.schedule(new Runnable() {
        @Override
        public void run() {
            crawlerHandle.cancel(true);
            scheduler.shutdown();
        }
    }, 60, MINUTES);

}

回答1:


Crawler4j version 3.6 and later has fixes that resolved this issue. I was using version 3.5 so I was having this issue. I later upgraded to version 4.1 and it was working.



来源:https://stackoverflow.com/questions/28636610/how-to-schedule-crawler4j-crawl-control-to-run-periodically

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!