How do I save the origin html file with Apache Nutch

后端 未结 5 1316
野的像风
野的像风 2020-12-06 14:13

I\'m new to search engines and web crawlers. Now I want to store all the original pages in a particular web site as html files, but with Apache Nutch I can only get the bina

5条回答
  •  半阙折子戏
    2020-12-06 14:46

    You must do modifications in run Nutch in Eclipse.

    When you are able to run, open Fetcher.java and add the lines between "content saver" command lines.

    case ProtocolStatus.SUCCESS:        // got a page
                pstatus = output(fit.url, fit.datum, content, status, CrawlDatum.STATUS_FETCH_SUCCESS, fit.outlinkDepth);
                updateStatus(content.getContent().length);'
    
    
                //------------------------------------------- content saver ---------------------------------------------\\
                String filename = "savedsites//" + content.getUrl().replace('/', '-');  
    
                File file = new File(filename);
                file.getParentFile().mkdirs();
                boolean exist = file.createNewFile();
                if (!exist) {
                    System.out.println("File exists.");
                } else {
                    FileWriter fstream = new FileWriter(file);
                    BufferedWriter out = new BufferedWriter(fstream);
                    out.write(content.toString().substring(content.toString().indexOf("

提交回复
热议问题