solr-cell

Get page numbers of searchresult of a pdf in solr

左心房为你撑大大i 提交于 2019-12-07 16:47:37
问题 I'm building a web application where users can search for pdf documents and view them with pdf.js. I would like to display the search results with a short snippet of the paragraph where the search term where found and a link to open the document at the right page. So what I need is the page number and a short text snippet of every search result. I'm using SOLR 4.1 to index pdf documents. The indexing itself works fine but I don't know how to get the page number and paragraph of a search

Solr ExtractingRequestHandler giving empty content for pdf documents

青春壹個敷衍的年華 提交于 2019-12-07 11:54:04
问题 I am using ExtractingRequestHandler in Solr for getting document content and index it. It works fine for all Microsoft Documents, but for PDFs, the content being extracted is empty. I have also tried the extractOnly=true with curl, and that also returns just the empty body. I have used TIKA independently on the same documents and that extracts content just fine. The difference is when doing independently I am using BodyContentHander that comes with Tika instead of SolrContentHandler which is

How do I index rich-format documents contained as database BLOBs with Solr 4.0+?

落爺英雄遲暮 提交于 2019-12-07 08:47:34
问题 I've found a few related solutions to this problem. The related solutions will not work for me as I'll explain. (I'm using Solr 4.0 and indexing data stored in an Oracle 11g database.) Jonck van der Kogel's related solution (from 2009) is explained here. He describes creating a custom Transformer, sort of like the ClobTransformer that ships with Solr. This is going down the elegant path but is not using Tika which is now integrated with Solr. (He uses external PDFBox and FontBox.) This

Solr : data import handler and solr cell

柔情痞子 提交于 2019-12-06 15:58:50
问题 Is it possible to index rich document (pdf, office)... with data import handler using solr cell. I use solr 3.2. Thanks. 回答1: Solr Cell, aka ExtractingRequestHandler, uses Apache Tika behind the scenes, and the latter can easily be integrated into a DataImportHandler: <dataConfig> <!-- use any of type DataSource<InputStream> --> <dataSource type="BinURLDataSource"/> <document> <!-- The value of format can be text|xml|html|none. this is the format in which the body is emited (the 'text' field)

Solr ExtractingRequestHandler giving empty content for pdf documents

风格不统一 提交于 2019-12-05 18:54:05
I am using ExtractingRequestHandler in Solr for getting document content and index it. It works fine for all Microsoft Documents, but for PDFs, the content being extracted is empty. I have also tried the extractOnly=true with curl, and that also returns just the empty body. I have used TIKA independently on the same documents and that extracts content just fine. The difference is when doing independently I am using BodyContentHander that comes with Tika instead of SolrContentHandler which is used by Solr. Has anybody seen this? I would really rather let Solr handle it than me using Tika to

Solr : data import handler and solr cell

北战南征 提交于 2019-12-04 22:55:52
Is it possible to index rich document (pdf, office)... with data import handler using solr cell. I use solr 3.2. Thanks. Solr Cell, aka ExtractingRequestHandler , uses Apache Tika behind the scenes, and the latter can easily be integrated into a DataImportHandler: <dataConfig> <!-- use any of type DataSource<InputStream> --> <dataSource type="BinURLDataSource"/> <document> <!-- The value of format can be text|xml|html|none. this is the format in which the body is emited (the 'text' field) . The implicit field 'text' will have that format. default value is 'text' (if not specified) . format=

Indexing PDF with Solr

心不动则不痛 提交于 2019-11-30 08:26:27
Can anyone point me to a tutorial. My main experience with Solr is indexing CSV files. But I cannot find any simple instructions/tutorial to tell me what I need to do to index pdfs. I have seen this: http://wiki.apache.org/solr/ExtractingRequestHandler But it makes very little sense to me. Do I need to install Tika? Im lost - please help With solr-4.9 (the latest version as of now), extracting data from rich documents like pdfs, spreadsheets(xls, xlxs family), presentations(ppt, ppts), documentation(doc, txt etc) has become fairly simple. The sample code examples provided in the downloaded