Index pdf file content using Apache Solr

此生再无相见时 提交于 2019-12-01 11:06:08

Solr with Apache Tika does the handling of extracting the Contents of the Rich Documents and adding it back to the Solr document.

Documentation :-

You may notice that although you can search on any of the text in the sample document, you may not be able to see that text when the document is retrieved. This is simply because the "content" field generated by Tika is mapped to the Solr field called "text", which is indexed but not stored. This is done via the default map rule in the /update/extract handler in solrconfig.xml and can be easily changed or overridden. For example, to store and see all metadata and content, execute the following:

Default schema.xml :-

<!-- Main body of document extracted by SolrCell.
    NOTE: This field is not indexed by default, since it is also copied to "text"
    using copyField below. This is to save space. Use this field for returning and
    highlighting document content. Use the "text" field to search the content. -->
<field name="content" type="text_general" indexed="false" stored="true" multiValued="true"/>

If you are defining a different attribute for maintaining the file contents override the default with fmap.content=filecontent in the solrconfig.xml itself.

The fmap.content=attr_content param overrides the default fmap.content=text causing the content to be added to the attr_content field instead.

If you want to index it in a single documment use literal prefix e.g. literal.id=1&literal.name=Name with the attributes

$ch = curl_init('
http://localhost:8010/solr/update/extract?literal.id=1&literal.name=Name&commit=true');
curl_setopt ($ch, CURLOPT_POST, 1);
curl_setopt ($ch, CURLOPT_POSTFIELDS, array('myfile'=>'@'.$post->filepath));
$result= curl_exec ($ch);
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!