Apache Nutch not adding internal links in a web page to fetchlist

对着背影说爱祢 提交于 2019-12-24 15:41:23

问题


I am using Apache Nutch 1.7 and I am facing this problem with crawling using the URL http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1 as the seed URL, this URL has many internal links present in the page and also has many external links to other domains , I am only interested in the internal links.

However when this page is crawled the internal links in it are not added for fetching in the next round of fetching ( I have given a depth of 100). I have already set the db.ignore.internal.links as false ,but for some reason the internal links are not getting added to the next round of fetch list.

On the other hand if I set the db.ignore.external.links as false, it correctly picks up all the external links from the page.

This problem is not present in any other domains , can some tell me what is it with this particular page ?

I have also attached the nucth-site.xml that I am using for your review, please advise.


回答1:


Your seed url is being ignored by the default filters, so your page is not being crawled.

Edit the following files:

conf/automaton-urlfilter.txt

conf/regex-urlfilter.txt

Replace

# skip URLs containing certain characters as probable queries, etc.
-.*[?*!@=].*

With

# skip URLs containing certain characters as probable queries, etc.
-.*[*!@].*


来源:https://stackoverflow.com/questions/19373714/apache-nutch-not-adding-internal-links-in-a-web-page-to-fetchlist

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!