How to increase number of documents fetched by Apache Nutch crawler

被刻印的时光 ゝ 提交于 2019-12-03 21:26:05
m5khan

One crawl cycle consists of four steps: Generate, Fetch, Parse and Update DB. for detailed information, read my answer here.

Whats causing limited URL fetch can be caused by the following factors:

Number of Crawl cycles:

If you are only executing one crawl cycle then you will get few results as the URLs injected or seeded into crawldb will be fetched initially. On progressive crawl cycles your crawldb will updated with new URLs extracted from previously fetched pages.

topN value:

As mentioned here and here, topN value cause nutch to fetch the limited number of URLs on each cycle. If you have small topN value, you will get less number of pages.

generate.max.count

generate.max.count in your nutch configuration file i.e nutch-default.xml or nutch-site.xml limits the number of URLs to be fetched form the single domain as stated here.


Answer to your second question on how to count number of pages crawled per day. What you can do is to read the log files. From there you can accumulate the information on the number of pages crawled per day.

In nutch 1.x log file is generated in log folder NUTCH_HOME/logs/hadoop.log

You can count the lines with respect to date and status "fetching" from the logs like this:

cat logs/hadoop.log | grep -i 2016-05-26.*fetching | wc -l

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!