how to parse html with nutch and index specific tag to solr?

后端 未结 4 1757
别那么骄傲
别那么骄傲 2021-01-13 07:25

i have installed nutch and solr for crawling a website and search in it; as you know we can index meta tags of webpages into solr with parse meta tags plugin of nutch.(http:

4条回答
  •  盖世英雄少女心
    2021-01-13 08:01

    I made my own plugin for something similar you want to. The config file for mapping NutchDocument to SolrDocument is in $NUTCH_HOME/conf/solrindex-mapping.xml. Here you can add your own tags. But still you have to fill your own tags somewhere.

    Here are some tips to plugin:

    • read http://wiki.apache.org/nutch/WritingPluginExample, here you can find how to make your plugin very simply
    • in your plugin extend the ParseFilter and IndexingFilter.
    • in YourParseFilter you can use NodeWalker to find your specific div
    • your parsed informations put into page metadata like this

      page.putToMetadata(new Utf8("yourKEY"), ByteBuffer.wrap(YourByteArrayParsedFromMetaData));

    • in YourIndexingFilter add the metadata from page (page.getMetadata) to NutchDocument

      doc.add("your_specific_tag", value);

    • most important!!!!!

    • put your_specific_tag to fileds of:

      • Solr config file schema.xml (and restart Solr)

      field name="your_specific_tag" type="string" stored="true" indexed="true"

      • Nutch config file schema.xml (don't know if it is realy neccessary)
      • Nutch config file solrindex-mapping.xml

      field dest="your_specific_tag" source="your_specific_tag"

提交回复
热议问题