How to read Nutch content from Java/Scala?

前提是你 提交于 2019-12-13 03:48:19

问题


I'm using Nutch to crawl some websites (as a process that runs separate of everything else), while I want to use a Java (Scala) program to analyse the HTML data of websites using Jsoup.

I got Nutch to work by following the tutorial (without the script, only executing the individual instructions worked), and I think it's saving the websites' HTML in the crawl/segments/<time>/content/part-00000 directory.

The problem is that I cannot figure out how to actually read the website data (URLs and HTML) in a Java/Scala program. I read this document, but find it a bit overwhelming since I've never used Hadoop.

I tried to adapt the example code to my environment, and this is what I arrived at (mostly by guesswprk):

  val reader = new MapFile.Reader(FileSystem.getLocal(new Configuration()), ".../apache-nutch-1.8/crawl/segments/20140711115438/content/part-00000", new Configuration())
  var key = null
  var value = null
  reader.next(key, value) // test for a single value
  println(key)
  println(value)

However, I am getting this exception when I run it:

Exception in thread "main" java.lang.NullPointerException
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1873)
    at org.apache.hadoop.io.MapFile$Reader.next(MapFile.java:517)

I am not sure how to work with a MapFile.Reader, specifically, what constructor parameters I am supposed to pass to it. What Configuration objects am I supposed to pass in? Is that the correct FileSystem? And is that the data file I'm interested in?


回答1:


Scala:

val conf = NutchConfiguration.create()
val fs = FileSystem.get(conf)
val file = new Path(".../part-00000/data")
val reader = new SequenceFile.Reader(fs, file, conf)

val webdata = Stream.continually {
  val key = new Text()
  val content = new Content()
  reader.next(key, content)
  (key, content)
}

println(webdata.head)

Java:

public class ContentReader {
    public static void main(String[] args) throws IOException { 
        Configuration conf = NutchConfiguration.create();       
        Options opts = new Options();       
        GenericOptionsParser parser = new GenericOptionsParser(conf, opts, args);       
        String[] remainingArgs = parser.getRemainingArgs();     
        FileSystem fs = FileSystem.get(conf);
        String segment = remainingArgs[0];
        Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data");
        SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf);
        Text key = new Text();
        Content content = new Content();
        // Loop through sequence files
        while (reader.next(key, content)) {
            try {
                System.out.write(content.getContent(), 0,
                        content.getContent().length);
            } catch (Exception e) {
            }
        }
    }
}

Alternatively, you can use org.apache.nutch.segment.SegmentReader (example).



来源:https://stackoverflow.com/questions/24699305/how-to-read-nutch-content-from-java-scala

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!