apache nutch crawler - keeps retrieve only single url

倖福魔咒の 提交于 2019-12-25 06:57:48

问题


INJECT step keeps retrieving only single URL - trying to crawl CNN. I'm with default config (below is the nutch-site) - what could that be - shouldn't it be 10 docs according to my value?

<configuration>
  <property>
    <name>http.agent.name</name>
    <value>crawler1</value>
  </property>
  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.hbase.store.HBaseStore</value>
    <description>Default class for storing data</description>
  </property>
  <property>
        <name>solr.server.url</name>
        <value>http://x.x.x.x:8983/solr/collection1</value>
  </property>
<property>
  <name>plugin.includes</name>
  <value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-reg
ex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|m
etatags)|index-(basic|anchor|more|metadata)</value>
</property>
<property>
  <name>db.ignore.external.links</name>
  <value>true</value>
</property>
<property>
  <name>generate.max.count</name>
  <value>10</value>
</property>
</configuration>

回答1:


Nutch crawl consists of 4 basic steps: Generate, Fetch, Parse and Update DB. These steps are the same for both nutch 1.x and nutch 2.x. Execution and completion of all four steps make one crawl cycle.

Injector can be the very first step that adds the URL to the crawldb; as stated here and here.

To populate initial rows for the webtable you can use the InjectorJob.

Which I reckon you have already provided i.e cnn.com

generate.max.count limits the number of URLs to be fetched form the single domain as stated here.

Now what matters is how many URLs from cnn.com your crawldb has.

Option 1

You have generate.max.count = 10 and you have seeded or injected more than 10 URLs to the crawldb then on executing crawl cycle, nutch should fetch no more than 10 URLs

Option 2

If you have injected only one URL and you have performed only one crawl cycle then on first cycle you will get only one document processed because only one URL was in your crawldb. Your crawldb will be update at the end of each crawl cycle. So on execution of your second crawl cycle and third crawl cycle and so on, nutch should resolve only up to 10 URLs from a specific domain.



来源:https://stackoverflow.com/questions/37353277/apache-nutch-crawler-keeps-retrieve-only-single-url

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!