How to parse content located in specific HTML tags using nutch plugin?

冷暖自知 提交于 2019-12-04 06:53:50

You can use this plugin to extract data from your pages based on css rules:

https://github.com/BayanGroup/nutch-custom-search

In your example, you can configure it in this way:

<config>
    <fields>
        <field name="custom_content" />
    </fields>
    <documents>
        <document url=".+" engine="css">
            <extract-to field="custom_content">
                <text>
                    <expr value="#abc" />
                </text>
                <text>
                    <expr value=".efg" />
                </text>
            </extract-to>
        </document>
    </documents>
</config>
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!