Adding URL parameter to Nutch/Solr index and search results
I can't find any hint on how to setup nutch to NOT filter/remove my URL parameters. I want to crawl and index some pages where lots of content is hidden behind the same base URLs (like /news.jsp?id=1 /news.jsp?id=2 /news.jsp?id=3 and so on). the regex-normalize.xml only removes redundant stuff from the URL (like session id, and trailing ?) the regex-urlfilter.txt seems to have a wildcard for my host (+^http://$myHost/) The crawling works fine so far. Any ideas? cheers, mana EDIT: A part of the solution is hidden here: configuring nutch regex-normalize.xml # skip URLs containing certain