how to parse html with nutch and index specific tag to solr?

為{幸葍}努か 提交于 2019-12-30 10:08:42

问题


i have installed nutch and solr for crawling a website and search in it; as you know we can index meta tags of webpages into solr with parse meta tags plugin of nutch.(http://wiki.apache.org/nutch/IndexMetatags) now i want to know is there any way to crawl another html tag to solr that isn't meta?(plugin or anyway) like this:

<div id=something>
      me specific tag
</div>

indeed i want to add a field to solr (something) that have value of "me specific tag" in this page.

any idea?


回答1:


I made my own plugin for something similar you want to. The config file for mapping NutchDocument to SolrDocument is in $NUTCH_HOME/conf/solrindex-mapping.xml. Here you can add your own tags. But still you have to fill your own tags somewhere.

Here are some tips to plugin:

  • read http://wiki.apache.org/nutch/WritingPluginExample, here you can find how to make your plugin very simply
  • in your plugin extend the ParseFilter and IndexingFilter.
  • in YourParseFilter you can use NodeWalker to find your specific div
  • your parsed informations put into page metadata like this

    page.putToMetadata(new Utf8("yourKEY"), ByteBuffer.wrap(YourByteArrayParsedFromMetaData));

  • in YourIndexingFilter add the metadata from page (page.getMetadata) to NutchDocument

    doc.add("your_specific_tag", value);

  • most important!!!!!

  • put your_specific_tag to fileds of:

    • Solr config file schema.xml (and restart Solr)

    field name="your_specific_tag" type="string" stored="true" indexed="true"

    • Nutch config file schema.xml (don't know if it is realy neccessary)
    • Nutch config file solrindex-mapping.xml

    field dest="your_specific_tag" source="your_specific_tag"




回答2:


u have to just try http://lifelongprogrammer.blogspot.in/2013/08/nutch2-crawl-and-index-extra-tag.html the tutorial said img tag how to get and what all are steps are there mention...




回答3:


You can use one of these custom plugins to parse xml files based on xpath (or css selectors):

  • https://github.com/BayanGroup/nutch-custom-search
  • http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/



回答4:


You may want to check Nutch Plugin which should allow you to extract an element from a web page.



来源:https://stackoverflow.com/questions/12338967/how-to-parse-html-with-nutch-and-index-specific-tag-to-solr

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!