Where is the crawled data stored when running nutch crawler?

巧了我就是萌 提交于 2019-12-04 09:51:59

After your crawl is over, you could use the bin/nutch dump command to dump all the urls fetched in plain html format.

The usage is as follows :

$ bin/nutch dump [-h] [-mimetype <mimetype>] [-outputDir <outputDir>]
   [-segment <segment>]
 -h,--help                show this help message
 -mimetype <mimetype>     an optional list of mimetypes to dump, excluding
                      all others. Defaults to all.
 -outputDir <outputDir>   output directory (which will be created) to host
                      the raw data
 -segment <segment>       the segment(s) to use

So for example you could do something like

$ bin/nutch dump -segment crawl/segments -outputDir crawl/dump/

This would create a new dir at the -outputDir location and dump all the pages crawled in html format.

There are many more ways of dumping out specific data from Nutch, have a look at https://wiki.apache.org/nutch/CommandLineOptions

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!