How do I save the origin html file with Apache Nutch

后端未结

关注

 5  1321

野的像风 2020-12-06 14:13

I\'m new to search engines and web crawlers. Now I want to store all the original pages in a particular web site as html files, but with Apache Nutch I can only get the bina

5条回答

北荒 (楼主)

2020-12-06 14:35

To update this answer -

It is possible to post process the data from your crawldb segment folder, and read in the html (including other data nutch has stored) directly.

    Configuration conf = NutchConfiguration.create();
    FileSystem fs = FileSystem.get(conf);

    Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data");
    SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf);

    try
    {
            Text key = new Text();
            Content content = new Content();

            while (reader.next(key, content)) 
            {
                    System.out.println(new String(content.GetContent()));
            }
    }
    catch (Exception e)
    {

    }

0 讨论(0)

查看其它5个回答