How do I save the origin html file with Apache Nutch

后端未结

关注

 5  1316

野的像风 2020-12-06 14:13

I\'m new to search engines and web crawlers. Now I want to store all the original pages in a particular web site as html files, but with Apache Nutch I can only get the bina

5条回答

半阙折子戏 (楼主)

2020-12-06 14:46

You must do modifications in run Nutch in Eclipse.

When you are able to run, open Fetcher.java and add the lines between "content saver" command lines.

case ProtocolStatus.SUCCESS:        // got a page
            pstatus = output(fit.url, fit.datum, content, status, CrawlDatum.STATUS_FETCH_SUCCESS, fit.outlinkDepth);
            updateStatus(content.getContent().length);'


            //------------------------------------------- content saver ---------------------------------------------\\
            String filename = "savedsites//" + content.getUrl().replace('/', '-');  

            File file = new File(filename);
            file.getParentFile().mkdirs();
            boolean exist = file.createNewFile();
            if (!exist) {
                System.out.println("File exists.");
            } else {
                FileWriter fstream = new FileWriter(file);
                BufferedWriter out = new BufferedWriter(fstream);
                out.write(content.toString().substring(content.toString().indexOf("