Nutch 1.2 - Why won't nutch crawl url with query strings?

£可爱£侵袭症+ 提交于 2019-12-12 06:24:14

问题


I'm new to Nutch and not really sure what is going on here. I run nutch and it crawl my website, but it seems to ignore URLs that contain query strings. I've commented out the filter in the crawl-urlfilter.txt page so it look like this now:

# skip urls with these characters
#-[]

#skip urls with slash delimited segment that repeats 3+ times
#-.*(/[^/]+)/[^/]+\1/[^/]+\1/

So, i think i've effectively removed any filter so I'm telling nutch to accept all urls it finds on my website.

Does anyone have any suggestions? Or is this a bug in nutch 1.2? Should i upgrade to 1.3 and will this fix this issue i am having? OR am i doing something wrong?


回答1:


See my previous question here Adding URL parameter to Nutch/Solr index and search results

The first 'Edit' should answer your question.




回答2:


# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

You have to comment it or modify it as :

# skip URLs containing certain characters as probable queries, etc.
-[*!@]



回答3:


By default, crawlers shouldn't crawl links with query strings to avoid spams and fake search engines.



来源:https://stackoverflow.com/questions/7045716/nutch-1-2-why-wont-nutch-crawl-url-with-query-strings

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!