Solr indexing following a Nutch crawl fails, reports “Job Failed”

橙三吉。 提交于 2019-12-03 03:16:03
Diaa

This happens when not all required fields from nutch are in the schema.xml of solr. Did you add the fields from Nutch's schema.xml?

If you add in the section "fields" the following, things should work:

<field name="id" type="string" stored="true" indexed="true"/>
<!-- core fields -->
<field name="segment" type="string" stored="true" indexed="false"/>
<field name="digest" type="string" stored="true" indexed="false"/>
<field name="boost" type="float" stored="true" indexed="false"/>

<!-- fields for index-basic plugin -->
<field name="host" type="string" stored="false" indexed="true"/>
<field name="url" type="url" stored="true" indexed="true"
    required="true"/>
<field name="content" type="text_general" stored="false" indexed="true"/>
<field name="title" type="text_general" stored="true" indexed="true"/>
<field name="cache" type="string" stored="true" indexed="false"/>
<field name="tstamp" type="date" stored="true" indexed="false"/>

<!-- fields for index-anchor plugin -->
<field name="anchor" type="string" stored="true" indexed="true"
    multiValued="true"/>

<!-- fields for index-more plugin -->
<field name="type" type="string" stored="true" indexed="true"
    multiValued="true"/>
<field name="contentLength" type="long" stored="true"
    indexed="false"/>
<field name="lastModified" type="date" stored="true"
    indexed="false"/>
<field name="date" type="date" stored="true" indexed="true"/>

<!-- fields for languageidentifier plugin -->
<field name="lang" type="string" stored="true" indexed="true"/>

<!-- fields for subcollection plugin -->
<field name="subcollection" type="string" stored="true"
    indexed="true" multiValued="true"/>

<!-- fields for feed plugin (tag is also used by microformats-reltag)-->
<field name="author" type="string" stored="true" indexed="true"/>
<field name="tag" type="string" stored="true" indexed="true" multiValued="true"/>
<field name="feed" type="string" stored="true" indexed="true"/>
<field name="publishedDate" type="date" stored="true"
    indexed="true"/>
<field name="updatedDate" type="date" stored="true"
    indexed="true"/>

<!-- fields for creativecommons plugin -->
<field name="cc" type="string" stored="true" indexed="true"
    multiValued="true"/>

<!-- fields for tld plugin -->    
<field name="tld" type="string" stored="false" indexed="false"/>

I had a similar problem with Nutch 1.8 and Solr 4.8.0. In fact Diaa's answer helped me solve the problem. After having removed some intersections of schema.xml with Diaa's field list and after having changed two entries marked as "added by wb" and "changed by wb" I ended up with the following field list which worked for me. As opposed to earlier versions of nutch and solr there is no tag for "fields" any more. Entries tagged as "field" are simply within "schema". This is the complete field list:

   <field name="_root_" type="string" indexed="true" stored="false"/>

   <!-- Only remove the "id" field if you have a very good reason to. While not strictly
     required, it is highly recommended. A <uniqueKey> is present in almost all Solr 
     installations. See the <uniqueKey> declaration below where <uniqueKey> is set to "id".
   -->   
   <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 

   <field name="sku" type="text_en_splitting_tight" indexed="true" stored="true" omitNorms="true"/>
   <field name="name" type="text_general" indexed="true" stored="true"/>
   <field name="manu" type="text_general" indexed="true" stored="true" omitNorms="true"/>
   <field name="cat" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="features" type="text_general" indexed="true" stored="true" multiValued="true"/>
   <field name="includes" type="text_general" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />

   <field name="weight" type="float" indexed="true" stored="true"/>
   <field name="price"  type="float" indexed="true" stored="true"/>
   <field name="popularity" type="int" indexed="true" stored="true" />
   <field name="inStock" type="boolean" indexed="true" stored="true" />

   <field name="store" type="location" indexed="true" stored="true"/>

   <!-- Common metadata fields, named specifically to match up with
     SolrCell metadata when parsing rich documents such as Word, PDF.
     Some fields are multiValued only because Tika currently may return
     multiple values for them. Some metadata is parsed from the documents,
     but there are some which come from the client context:
       "content_type": From the HTTP headers of incoming stream
       "resourcename": From SolrCell request param resource.name
   -->
   <field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/>
   <field name="subject" type="text_general" indexed="true" stored="true"/>
   <field name="description" type="text_general" indexed="true" stored="true"/>
   <field name="comments" type="text_general" indexed="true" stored="true"/>
   <field name="author" type="text_general" indexed="true" stored="true"/>
   <field name="keywords" type="text_general" indexed="true" stored="true"/>
   <field name="category" type="text_general" indexed="true" stored="true"/>
   <field name="resourcename" type="text_general" indexed="true" stored="true"/>

   <!-- added by wb: required="true" -->
   <field name="url" type="text_general" indexed="true" stored="true" required="true"/> 

   <field name="content_type" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="last_modified" type="date" indexed="true" stored="true"/>
   <field name="links" type="string" indexed="true" stored="true" multiValued="true"/>

   <!-- Main body of document extracted by SolrCell.
        NOTE: This field is not indexed by default, since it is also copied to "text"
        using copyField below. This is to save space. Use this field for returning and
        highlighting document content. Use the "text" field to search the content. -->

   <!-- changedby wb: indexed="true" -->
   <field name="content" type="text_general" indexed="true" stored="true" multiValued="true"/> 


   <!-- catchall field, containing all other searchable text fields (implemented
        via copyField further on in this schema  -->
   <field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>

   <!-- catchall text field that indexes tokens both normally and in reverse for efficient
        leading wildcard queries. -->
   <field name="text_rev" type="text_general_rev" indexed="true" stored="false" multiValued="true"/>

   <!-- non-tokenized version of manufacturer to make it easier to sort or group
        results by manufacturer.  copied from "manu" via copyField -->
   <field name="manu_exact" type="string" indexed="true" stored="false"/>

   <field name="payloads" type="payloads" indexed="true" stored="true"/>

   <!-- Fields needed for Nutch 1.8 integration: -->

    <field name="segment" type="string" stored="true" indexed="false"/>
    <field name="digest" type="string" stored="true" indexed="false"/>
    <field name="boost" type="float" stored="true" indexed="false"/>

    <!-- fields for index-basic plugin -->
    <field name="host" type="string" stored="false" indexed="true"/>
    <field name="cache" type="string" stored="true" indexed="false"/>
    <field name="tstamp" type="date" stored="true" indexed="false"/>

    <!-- fields for index-anchor plugin -->
    <field name="anchor" type="string" stored="true" indexed="true" multiValued="true"/>

    <!-- fields for index-more plugin -->
    <field name="type" type="string" stored="true" indexed="true" multiValued="true"/>
    <field name="contentLength" type="long" stored="true" indexed="false"/>
    <field name="lastModified" type="date" stored="true" indexed="false"/>
    <field name="date" type="date" stored="true" indexed="true"/>

    <!-- fields for languageidentifier plugin -->
    <field name="lang" type="string" stored="true" indexed="true"/>

    <!-- fields for subcollection plugin -->
    <field name="subcollection" type="string" stored="true" indexed="true" multiValued="true"/>

    <!-- fields for feed plugin (tag is also used by microformats-reltag)-->
    <field name="tag" type="string" stored="true" indexed="true" multiValued="true"/>
    <field name="feed" type="string" stored="true" indexed="true"/>
    <field name="publishedDate" type="date" stored="true" indexed="true"/>
    <field name="updatedDate" type="date" stored="true" indexed="true"/>

    <!-- fields for creativecommons plugin -->
    <field name="cc" type="string" stored="true" indexed="true" multiValued="true"/>

    <!-- fields for tld plugin -->    
    <field name="tld" type="string" stored="false" indexed="false"/>

   <!-- End of fields needed for Nutch 1.8 integration: -->

hello i know this question is old , but for the people using nutch and solr in 2017 with version (nutch 1.13 , solr 5.5.0) , i had the same problem i just solve with following solution

bin/crawl -i -D solr.server.url=http://localhost:8983/solr/#/nutch urls/ TestCrawl2/ 1

above is the command i a using for crawl , but i had same error when i using this

bin/crawl -i -D solr.server.url=http://localhost:8983/solr/nutch urls TestCrawl2 2

i just remove the '/' after urls/ TestCrawl2/ , it works for me thanks

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!