问题
I want to remove specific elements from the page response, before it is handed down to nutch. Specifically, I want to mark parts of my pages with i.e.
<div class="noindex">I shall not be indexed</div>
And want to remove them before nutch parse, so that "I shall not be indexed" is not present in the NutchDocument afterwards. I plan die surround my navigation, header, footer content with this because right now, they are present in every document in the index.
Thanks, Paul
回答1:
You have some alternativer for doing that:
You can write a plugin for nutch for doing that. This blog have an execelent example of doing a plugin in nutch: http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html
Using an extractor content: Here http://tomazkovacic.com/blog/122/evaluating-text-extraction-algorithms/ have some algorithmics. Maybe the best way of doing that it´s also in a pluggin in nutch.
来源:https://stackoverflow.com/questions/8576735/apache-nutch-manipulating-the-dom-before-parsing