How do I save the origin html file with Apache Nutch

后端 未结 5 1321
野的像风
野的像风 2020-12-06 14:13

I\'m new to search engines and web crawlers. Now I want to store all the original pages in a particular web site as html files, but with Apache Nutch I can only get the bina

5条回答
  •  北荒
    北荒 (楼主)
    2020-12-06 14:35

    To update this answer -

    It is possible to post process the data from your crawldb segment folder, and read in the html (including other data nutch has stored) directly.

        Configuration conf = NutchConfiguration.create();
        FileSystem fs = FileSystem.get(conf);
    
        Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data");
        SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf);
    
        try
        {
                Text key = new Text();
                Content content = new Content();
    
                while (reader.next(key, content)) 
                {
                        System.out.println(new String(content.GetContent()));
                }
        }
        catch (Exception e)
        {
    
        }
    

提交回复
热议问题