solr-cell

Solr ExtractingRequestHandler extracting “rect” in links

若如初见. 提交于 2020-12-29 13:25:07
问题 I am utilizing solr ExtractingRequestHandler to extract and index HTML content. My issue comes to the extracted links section that it produces. The extracted content returned has "rect" inserted where they do not exist in the HTML source. I have my solrconfig cell configuration as follows: <requestHandler name="/upate/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults"> <str name="lowernames">true</str> <!-- capture link hrefs but ignore div

Solr ExtractingRequestHandler extracting “rect” in links

依然范特西╮ 提交于 2020-12-29 13:23:47
问题 I am utilizing solr ExtractingRequestHandler to extract and index HTML content. My issue comes to the extracted links section that it produces. The extracted content returned has "rect" inserted where they do not exist in the HTML source. I have my solrconfig cell configuration as follows: <requestHandler name="/upate/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults"> <str name="lowernames">true</str> <!-- capture link hrefs but ignore div

Solr open document after searching a keyword

旧巷老猫 提交于 2020-01-02 10:58:33
问题 I am trying to index some PDF documents and then create a Search UI . This question is somewhat related to Solr Index PDF documents and post them to a remote server 1) Indexing PDF Docs - > I use tika jar to convert PDF to text files and then use curl command to index them. 2) Search UI --> I m using Solritas browse feature and its built in UI. Objective : When I search for a word say "Lucene" in the list of indexed documents and when I get a result set for the given query I want a link to be

How do I index documents in SOLR?

混江龙づ霸主 提交于 2019-12-30 04:40:08
问题 Im running Solr 1.4 on Ubuntu 10.04 (installed via apt-get solr-tomcat) and it seems to be working fine. Im having some difficulty finding any coherent info on how to index documents though. Im new to SOLR so bear with me! I have a folder (/mnt/folder) that is a mounted windows share, which contains Word and PDF files that I would like indexed, whats the easiest way to get SOLR to index the entire folder? The documentation for SOLR is pretty poor, its impossilbe to find any decent tutorials

Can Solr retain the formatting of the HTML documents whcih was fed to it in its result?

时光毁灭记忆、已成空白 提交于 2019-12-25 09:42:11
问题 How do I maintain the Original formatting of the HTML document in the results given by Solr? I am trying to provide search functionality in one of my companies website that is having millions of documents and all are not having similar formatting, So it is hard to individually format each document. I am using Solr 4.1 nightly builds at apache site which is having inbuilt support for solr-cell and tika. i.e. i need not to separately configure them. does solr-cell or tika retains these

Error while indexing .xml files in solr

久未见 提交于 2019-12-24 09:27:02
问题 I am trying to index xml files in solr search engine using following command: java -Durl=http://10.1.11.143:8080/solr/#/ -jar post.jar solr.xml But I am getting following error: SimplePostTool version 1.5 Posting files to base url http://10.1.11.143:8080/solr/#/ using content-type application/xml.. POSTing file solr.xml SimplePostTool: WARNING: Solr returned an error #500 Internal Server Error SimplePostTool: WARNING: IOException while reading response: java.io.IOException: Server returned

Indexing PDF with Solr

∥☆過路亽.° 提交于 2019-12-18 12:54:43
问题 Can anyone point me to a tutorial. My main experience with Solr is indexing CSV files. But I cannot find any simple instructions/tutorial to tell me what I need to do to index pdfs. I have seen this: http://wiki.apache.org/solr/ExtractingRequestHandler But it makes very little sense to me. Do I need to install Tika? Im lost - please help 回答1: With solr-4.9 (the latest version as of now), extracting data from rich documents like pdfs, spreadsheets(xls, xlxs family), presentations(ppt, ppts),

How to index pdf's content with SolrJ?

最后都变了- 提交于 2019-12-11 00:38:11
问题 I'm trying to index a few pdf documents using SolrJ as described at http://wiki.apache.org/solr/ContentStreamUpdateRequestExample, below there's the code: import static org.apache.solr.handler.extraction.ExtractingParams.LITERALS_PREFIX; import static org.apache.solr.handler.extraction.ExtractingParams.MAP_PREFIX; import static org.apache.solr.handler.extraction.ExtractingParams.UNKNOWN_FIELD_PREFIX; import org.apache.solr.client.solrj.SolrServer; import org.apache.solr.client.solrj

Solr's TikaEntityProcessor not working

≯℡__Kan透↙ 提交于 2019-12-10 18:13:35
问题 I'm trying to get Solr to index a database in which one column is a filename of a PDF document I'd like to index. My configuration looks like this: <dataConfig> <dataSource name="ds-db" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/document_db" user="user" password="password" readOnly="true"/> <dataSource name="ds-file" type="BinFileDataSource"/> <document name="documents"> <entity name="document" dataSource="ds-db" query="select * from documents"> <entity processor=

Using Solr CELL's ExtractingRequestHandler to index/extract files from package formats

我与影子孤独终老i 提交于 2019-12-08 10:08:52
问题 Can you use ExtractingRequestHandler and Tika with any of the compressed file formats (zip, tar, gz, etc) to extract the content out for indexing? I am sending solr the archived.tar file using curl. curl " http://localhost:8983/solr/update/extract?literal.id=doc1&fmap.content=body_texts&commit=true" -H 'Content-type:application/octet-stream' --data-binary "@/home/archived.tar" The result I get when I query the document is that the file names inside the archive are indexed as the "body_texts",