indexing all documents in doc folder in to solr FileListEntityProcessor

人走茶凉 提交于 2019-12-01 10:22:10

问题


http://wiki.apache.org/solr/ExtractingRequestHandler does not provide much information how to configure this handler in an webapplication which has its own context and want to use solr as server features as embebdedd solr . Can you please provide some information on how to upload the documents to solr and search for some content from those documents? I have configured DIH as in solrConf.xml

<requestHandler name="/dataimport" 
   class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
        <str name="config">tika-data-config.xml</str>
    </lst>
  </requestHandler>

and tika-data-config.xml looks like

<dataConfig>
    <dataSource type="BinFileDataSource" name="bin" />
    <document>
      <entity name="sd"
        processor="FileListEntityProcessor"
        newerThan="'NOW-30DAYS'"
        filenName=".*\.(DOC)|(PDF)|(pdf)|(doc)|(docx)|(ppt)"
        baseDir="G:/workspace/FacetedSearch/src/solr/docs"
        recursive="true"
        rootEntity="false"
          >
            <field column="fileAbsolutePath" name="path" />
            <field column="fileSize" name="size" />
            <field column="fileLastModified" name="lastmodified" />
            <field column="fileAbsolutePath" name="text" />  
            <!-- <field column="fileName" name="text" /> -->
            <field column="baseDir" name="text" />

        <!-- <entity name="tika-test" processor="TikaEntityProcessor" 
          url="${sd.fileAbsolutePath}" format="text" dataSource="bin">
         -->
         <entity name="tika-test" 
                 dataSource="bin"  
                 processor="TikaEntityProcessor" 
                 url="G:/workspace/FacetedSearch/src/solr/docs" 
                 format="text" >
          <field column="Author" name="author" meta="true"/>
          <field column="Content-Type" name="title" meta="true"/>
          <field column="title" name="title" meta="true"/>
          <field column="text" name="text"/>

        </entity>


    </entity>
  </document>

</dataConfig>

the dir G:/workspace/FacetedSearch/src/solr/docs contains many pdf and html files some of them are tutorial.pdf......index.pdf

after this configuration when i build solrQuery object as

    CoreContainer.Initializer initializer = new CoreContainer.Initializer();
    CoreContainer coreContainer = initializer.initialize();
    EmbeddedSolrServer solrServer = new EmbeddedSolrServer(coreContainer, "");  
    SolrQuery solrQuery = new SolrQuery();
    solrQuery.addField("literal.id");   
    solrQuery.setQuery("index.pdf");
    QueryResponse queryResponse = null ;
    try{
    queryResponse = (QueryResponse) solrServer.query(solrQuery);
    }catch(Exception e){
    System.out.println("exception occured while processing the solrQuery "+ 
    e.getMessage() +"stack trace " + e + solrQuery.toString()); 
    }
    out.println(queryResponse);

i do not get any result (here queryResponse is null). I have the schema.xml distributed by solr 3.5 and added some fields as

<field name="path" type="text_general" indexed="true" stored="true" />   
<field name="lastmodified" type="date" indexed="true" stored="true" />

I have question like are the documents in "G:/workspace/FacetedSearch/src/solr/docs" will be indexed by solr on solr startup? If these are indexed how can i get the result?

Can any one please let me know where i am doing wrong?

Please let me know if any more information needed from me in getting my answeres.

来源:https://stackoverflow.com/questions/10252822/indexing-all-documents-in-doc-folder-in-to-solr-filelistentityprocessor

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!