How to increase number of documents fetched by Apache Nutch crawler

巧了我就是萌 提交于 2019-12-05 06:22:04

问题


I am using Apache Nutch 2.3 for crawling. There were about 200 urls in seed at start. Now as the time elasped, number of documents crawler are going to decrease or atmost same as at start.

How I can configure Nutch so that my documents crawled should be increased? Is there any parameter that can be used to control number of documents? Second, how I can count number of documents crawled per day by nutch?


回答1:


One crawl cycle consists of four steps: Generate, Fetch, Parse and Update DB. for detailed information, read my answer here.

Whats causing limited URL fetch can be caused by the following factors:

Number of Crawl cycles:

If you are only executing one crawl cycle then you will get few results as the URLs injected or seeded into crawldb will be fetched initially. On progressive crawl cycles your crawldb will updated with new URLs extracted from previously fetched pages.

topN value:

As mentioned here and here, topN value cause nutch to fetch the limited number of URLs on each cycle. If you have small topN value, you will get less number of pages.

generate.max.count

generate.max.count in your nutch configuration file i.e nutch-default.xml or nutch-site.xml limits the number of URLs to be fetched form the single domain as stated here.


Answer to your second question on how to count number of pages crawled per day. What you can do is to read the log files. From there you can accumulate the information on the number of pages crawled per day.

In nutch 1.x log file is generated in log folder NUTCH_HOME/logs/hadoop.log

You can count the lines with respect to date and status "fetching" from the logs like this:

cat logs/hadoop.log | grep -i 2016-05-26.*fetching | wc -l



来源:https://stackoverflow.com/questions/30364242/how-to-increase-number-of-documents-fetched-by-apache-nutch-crawler

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!