How to parse content located in specific HTML tags using nutch plugin?
I am using Nutch to crawl websites and I want to parse specific sections of html pages crawled by Nutch. For example, <h><title> title to search </title></h> <div id="abc"> content to search </div> <div class="efg"> other content to search </div> I want to parse div element with id ="abc" and class="efg" and so on. I know that I have to create a plugin for customized parsing as htmlparser plugin provided by Nutch removes all html tags, css and javascript content and leaves only text content. I refered to this blog http://sujitpal.blogspot.in/2009/07/nutch-custom-plugin-to-parse-and-add.html