Removing menu's from html during crawl or indexing with nutch and solr
I am crawling our large website(s) with nutch and then indexing with solr and the results a pretty good. However, there are several menu structures across the site that index and spoil the results of a query. Each of these menus is clearly defined in a DIV so <div id="RHBOX"> ... </div> or <div id="calendar"> ...</div> and several others. I need to, at some point, delete the content in these DIVS. I am guessing that the right place is during indexing by solr but cannot work out how. A pattern would look something like (<div id="calendar">).*?(<\/div>) but i cannot get that to work in