Extracting Fetched Web Pages from Nutch in a Map Reduce Friendly Format

自古美人都是妖i 提交于 2019-12-04 16:52:26

The bin/nutch readseg command produces output in a human readable format and not map reduce format. The data is stored in segments in map-reduce format. I dont think that you can directly pull out that info from segements in map-reduce format.

Few options for your concern:

  1. Segments are themselves map-reduce format files. Can you re-use those ?
  2. The output of readseg command can be converted to map-reduce form by writing a small map-reduce code.

The answer lies in tweaking the source code of nutch. This turned out to be quite simple. Navigate to the SegmentReader.java file at apache-nutch-1.4-bin/src/java/org/apache/nutch/segment

Inside the SegmentReader class is a method reduce which is responsible for generating the human readable output the bin/nutch readseg command generates. Alter the StringBuffer dump variable as you see fit - this holds the entire output for a given url which is represented by the key variable.

Make sure you to run ant to create a new binary and further calls to bin/nutch readseg shall generate the output in your custom format.

These references were extremely useful in navigating the code:
[1] http://nutch.apache.org/apidocs-1.4/overview-summary.html
[2] http://nutch.apache.org/apidocs-1.3/index-all.html

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!