How to parse content located in specific HTML tags using nutch plugin?

大憨熊 提交于 2019-12-06 01:11:32

问题


I am using Nutch to crawl websites and I want to parse specific sections of html pages crawled by Nutch. For example,

  <h><title> title to search </title></h>
   <div id="abc">
        content to search
   </div>
   <div class="efg">
        other content to search
   </div>

I want to parse div element with id ="abc" and class="efg" and so on.

I know that I have to create a plugin for customized parsing as htmlparser plugin provided by Nutch removes all html tags, css and javascript content and leaves only text content. I refered to this blog http://sujitpal.blogspot.in/2009/07/nutch-custom-plugin-to-parse-and-add.html but I found that this is for parsing with html tag whereas I want to parse html tags with attribute having specific value. I found that Jericho has been mentioned as useful for parsing specific html tags but I could find any example for nutch plugin associated with Jericho.

I need some guidance about how to devise a strategy for parsing html pages on the basis of tags with attribute having specific value.


回答1:


You can use this plugin to extract data from your pages based on css rules:

https://github.com/BayanGroup/nutch-custom-search

In your example, you can configure it in this way:

<config>
    <fields>
        <field name="custom_content" />
    </fields>
    <documents>
        <document url=".+" engine="css">
            <extract-to field="custom_content">
                <text>
                    <expr value="#abc" />
                </text>
                <text>
                    <expr value=".efg" />
                </text>
            </extract-to>
        </document>
    </documents>
</config>


来源:https://stackoverflow.com/questions/17972582/how-to-parse-content-located-in-specific-html-tags-using-nutch-plugin

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!