Empty Nutch crawl list

狂风中的少年 提交于 2019-12-25 03:09:30

问题


I'm trying to run a crawl using Nutch in Eclipse.

I'm using a file called urls, and it contains

http://www.google.com/

However, when I run the project, the Generator class tells me that:

"0 records selected for fetching, exiting"

How can I solve this issue?

I've followed these documentations:

http://wiki.apache.org/nutch/RunNutchInEclipse1.0

http://wiki.apache.org/nutch/NutchTutorial

Any help would be greatly appreciated.


回答1:


I recently ran into this issue and found that most responses concerned the (regex|crawl)-urlfiters.txt. Another thing to check is your '-topN' settings. This needs to be large enough for the generator to pass all filters.

I hope this helps.




回答2:


Its most likely your regex-urlfilter.xml. Try using this and see if it fixes the problem

-^(file|ftp|mailto):

-.(gif|GIF|jpg|JPG|png|PNG|ico|js|ICO|doc|mp3|MP3|DOC|css|rss|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

-.*(/[^/]+)/[^/]+\1/[^/]+\1/

+.



来源:https://stackoverflow.com/questions/4479846/empty-nutch-crawl-list

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!