Solr 6 and Nutch 2.3.1 integration

非 Y 不嫁゛ 提交于 2019-12-10 16:13:56

问题


According to Nutch news the latest version of Nutch is 2.3.1 compatible with Solr 4.10.3 which is very old version of solr.

Can we integrate Solr 6 with Nutch 2.3.1. What will be the drawbacks if solr 6 will be integrated? Anybody tried this?


回答1:


This is an old question but I just got Nutch 1.12 talking to Solr 6.3.0. The required schema/solrconfig changes should be the same for Nutch 2.x so here's what I did:

Download and extract both products into some directory, e.g. ~/mycrawler, then go into the solr directory and create a core for nutch:

solr-6.3.0/bin $ ./solr start
solr-6.3.0/bin $ ./solr create_core -c nutch -d basic_configs
solr-6.3.0/bin $ ./solr stop

This will create solr-6.3.0/server/solr/nutch where the schema etc. will be located. Now, we need to remove the new auto-managed schema definition and replace it with the nutch-supplied schema.xml:

solr-6.3.0/server/solr/nutch/conf $ rm managed-schema
solr-6.3.0/server/solr/nutch/conf $ cp ~/mycrawler/apache-nutch-1.12/conf/schema.xml .

Now edit schema.xml and remove all instances of enablePositionIncrements="true" in all <filter class="solr.StopFilterFactory" ignoreCase="true" ... definitions.

Also in solr-6.3.0/server/solr/nutch/conf/solrconfig.xml, comment these typeMapping blocks, so you get:

<processor class="solr.AddSchemaFieldsUpdateProcessorFactory">
  <str name="defaultFieldType">strings</str>
    <!--
  <lst name="typeMapping">
    <str name="valueClass">java.lang.Boolean</str>
    <str name="fieldType">booleans</str>
  </lst>
  <lst name="typeMapping">
    <str name="valueClass">java.util.Date</str>
    <str name="fieldType">tdates</str>
  </lst>
  <lst name="typeMapping">
    <str name="valueClass">java.lang.Long</str>
    <str name="valueClass">java.lang.Integer</str>
    <str name="fieldType">tlongs</str>
  </lst>
  <lst name="typeMapping">
    <str name="valueClass">java.lang.Number</str>
    <str name="fieldType">tdoubles</str>
  </lst>
    -->
</processor>

Now start the server again:

solr-6.3.0/bin $ ./solr start

If you go to the admin gui, it should show the core as started with no further schema issues.

Now the crawl script can be run and will successfully write into our bleeding edge Solr (this is probably slightly different for Nutch 2):

./crawl -i \
    -D solr.server.url=http://localhost:8983/solr/nutch \ 
    ~/mycrawler/nutch_work/seed \
    ~/mycrawler/nutch_work/crawl  \
    1


来源:https://stackoverflow.com/questions/38525848/solr-6-and-nutch-2-3-1-integration

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!