why is crawler4j hanging randomly?

天涯浪子 提交于 2019-12-12 01:58:12

问题


I've been using crawler4j for a few months now. I recently started noticing that it hangs on some of the sites to never return. The recommended solution is to set resumable to true. This is not an option for me as I am limited on space. I ran multiple tests and noticed that the hang was very random. It will crawl between 90-140 urls and then stop. I thought maybe it was the site but there is nothing suspicious in the sites robot.txt and all pages respond with 200 OK. I know the crawler hasn't crawled the entire site otherwise it would shutdown. What could be causing this and where should I start?

whats interesting is that i start crawlers with nonBlocking and after is a while loop checking status

controller.startNonBlocking(CrawlProcess.class, numberOfCrawlers);

while(true){
  System.out.println("While looping");
}

when the crawler hangs the while loop also stops responding but the thread is still alive. Which means that the entire thread is not responsive. Therefore, I am unable to send a shutdown command.

UPDATE I figured out what is causing it to hang. I run a store in mysql step in the visit method. The step looks like this:

public void insertToTable(String dbTable, String url2, String cleanFileName, String dmn, String AID, 
        String TID, String LID, String att, String ttl, String type, String lbl, String QL,
        String referrer, String DID, String fp_type, String ipAddress, String aT, String sNmbr) throws SQLException, InstantiationException, IllegalAccessException, ClassNotFoundException{
    try{
        String strdmn = "";
        if(dmn.contains("www")){
            strdmn = dmn.replace("http://www.","");
        }else{
            strdmn = dmn.replace("http://","");
        }
        String query = "INSERT INTO "+dbTable
                +" (url,filename, dmn, AID, TID, LID, att, ttl, type, lbl, tracklist, referrer, DID, searchtype, description, fp_type, ipaddress," +
                " aT, sNmbr, URL_Hash, iteration)VALUES('"
                +url2+"','"+cleanFileName+"','"+strdmn+"','"+AID+"','"+TID+"','"+LID+"','"+att+"','"+ttl+"','"+type+"'" +
                ",'"+lbl+"','"+QL+"','"+dmn+"','"+DID+"','spider','"+cleanFileName+"','"+fp_type+"'," +
                "'"+ipAddress+"','"+aT+"','"+sNmbr+"',MD5('"+url2+"'), 1) ON DUPLICATE KEY UPDATE iteration = iteration + 1";
        Statement st2 = null;
        con = DbConfig.openCons();
        st2 = con.createStatement();
        st2.executeUpdate(query);
        //st2.execute("SELECT NOW()");
        st2.close();
        con.close();
        if(con.isClosed()){
            System.out.println("CON is CLOSED");
        }else{
            System.out.println("CON is OPEN");
        }
        if(st.isClosed()){
            System.out.println("ST is CLOSED");
        }else{
            System.out.println("ST is OPEN");
        }
    }catch(NullPointerException npe){
        System.out.println("NPE: " + npe);
    }
}

what's very interesting is when I run the st2.execute("SELECT NOW()"); instead of the current st2.execute(query); it works fine and crawls the site without hanging. But for some reason st2.execute(query) causes it to hang after a few queries. It's not mysql because it doesn't output any exceptions. i thought maybe im getting a "too many connections" from mysql but that isn't the case. Does my process make sense to anyone?


回答1:


The importance of a finally block.

The crawler4j is using c3p0 pooling to insert into mysql. After a few queries the crawler would stop responding. It turned out to be a connection leak in c3p0 thanks to @djechlin's advice. I added a finally block like below and it works great now!

try{
   //the insert method is here
}catch(SQLException e){
  e.printStackTrace();
}finally{
  if(st != null){
    st.close();
  }
  if(rs != null){
   rs.close();
  }

}


来源:https://stackoverflow.com/questions/24807637/why-is-crawler4j-hanging-randomly

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!