apache-tika

Parsing HTML issues with Apache Tika

自闭症网瘾萝莉.ら 提交于 2019-12-08 11:50:07
问题 I am crawling a webpage and after crawling it extract all the links from that webpage and then I am trying to parse all the url using Apache Tika and BoilerPipe by using below code so for some url it is parsing very well but for some I get error like this. And it shows some error on HTMLParser.java: line number 102. This is line number 102 in HTMLParser.java String parsedText = tika.parseToString(htmlStream, md); I have provided the HTMLParse code also. org.apache.tika.exception.TikaException

Configuring Tika With Solr

烂漫一生 提交于 2019-12-08 11:42:51
问题 I am Looking to index Rich types documents(Pdf, Doc, rtf, txt) into Solr. I found Tika as a solution. I made a rant over the web but didn't found any Docs/links to make it work with ExtractingRequestHandler. Anyone can please provide step by step way to configure Tika with ExtractingRequestHandler. Thanks In Advance :) 回答1: Check ExtractingRequestHandler for Integration of Solr with Tika. Solr provides tika.config inbuilt and you would not need to define it unless overriding the config. You

Using Solr CELL's ExtractingRequestHandler to index/extract files from package formats

我与影子孤独终老i 提交于 2019-12-08 10:08:52
问题 Can you use ExtractingRequestHandler and Tika with any of the compressed file formats (zip, tar, gz, etc) to extract the content out for indexing? I am sending solr the archived.tar file using curl. curl " http://localhost:8983/solr/update/extract?literal.id=doc1&fmap.content=body_texts&commit=true" -H 'Content-type:application/octet-stream' --data-binary "@/home/archived.tar" The result I get when I query the document is that the file names inside the archive are indexed as the "body_texts",

Apache Tika 1.11 on Spark NoClassDeftFoundError

我与影子孤独终老i 提交于 2019-12-08 04:28:18
问题 I'm trying to use apache tika on top of Spark. However, i'm having issues with configuration. My best guess at the moment is that the dependencies (of which tika has a lot...) are not bundled with the JAR for spark. If this intuition is correct I am unsure what the best path forward is. But i am also not certain that that is even my issue. The following is a pretty simple spark job which compiles but hits a runtime error when it gets to the Tika instantiation. My pom.xml is as follows:

ContentExtraction of PDF file in solr using Apache Tika

主宰稳场 提交于 2019-12-08 04:22:04
问题 I am trying to index the PDF file in the solr using the following tutorial http://wiki.apache.org/solr/ExtractingRequestHandler But everytime i am firing the command java -jar post.jar *.pdf it says some org.apache.solr.common.SolrException: Invalid UTF-8 middle byte 0xe3 Error Kindly help me in indexing the PDF to solr server.Is there any other integration then tika which can help me. 回答1: Post.jar is just an utility to upload files to Solr. Solr uses Extract handler so you need to provide

Apache Tika 1.11 on Spark NoClassDeftFoundError

只谈情不闲聊 提交于 2019-12-08 03:34:27
I'm trying to use apache tika on top of Spark. However, i'm having issues with configuration. My best guess at the moment is that the dependencies (of which tika has a lot...) are not bundled with the JAR for spark. If this intuition is correct I am unsure what the best path forward is. But i am also not certain that that is even my issue. The following is a pretty simple spark job which compiles but hits a runtime error when it gets to the Tika instantiation. My pom.xml is as follows: <project> <groupId>tika.test</groupId> <artifactId>tikaTime</artifactId> <modelVersion>4.0.0</modelVersion>

Managed bean with a parameterized bean class must be @Dependent: class org.apache.cxf.jaxrs.provider.AbstractCachingMessageProvider

五迷三道 提交于 2019-12-08 03:15:00
问题 After adding tika parser in my application I am getting the following error in my Spring Application. I am running the application on wildfly 10.1.1 final . 11:11:30,371 ERROR [org.jboss.msc.service.fail] (MSC service thread 1-2) MSC000001: Failed to start service jboss.deployment.unit."MyApp.war".WeldStartService: org.jboss.msc.service.StartException in service jboss.deployment.unit."MyApp.war".WeldStartService: Failed to start service at org.jboss.msc.service.ServiceControllerImpl$StartTask

Get page numbers of searchresult of a pdf in solr

左心房为你撑大大i 提交于 2019-12-07 16:47:37
问题 I'm building a web application where users can search for pdf documents and view them with pdf.js. I would like to display the search results with a short snippet of the paragraph where the search term where found and a link to open the document at the right page. So what I need is the page number and a short text snippet of every search result. I'm using SOLR 4.1 to index pdf documents. The indexing itself works fine but I don't know how to get the page number and paragraph of a search

Classpath issues running Tika on Spark

邮差的信 提交于 2019-12-07 15:53:44
问题 I try to process a bunch of files in Tika. The number of files is in the thousands so I decided to build an RDD of files and let Spark distribute the workload. Unfortunatly I get multiple NoClassDefFound Exceptions. This is my sbt file: name := "TikaFileParser" version := "0.1" scalaVersion := "2.11.7" libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1" % "provided" libraryDependencies += "org.apache.tika" % "tika-core" % "1.11" libraryDependencies += "org.apache.tika" % "tika

Solr ExtractingRequestHandler giving empty content for pdf documents

青春壹個敷衍的年華 提交于 2019-12-07 11:54:04
问题 I am using ExtractingRequestHandler in Solr for getting document content and index it. It works fine for all Microsoft Documents, but for PDFs, the content being extracted is empty. I have also tried the extractOnly=true with curl, and that also returns just the empty body. I have used TIKA independently on the same documents and that extracts content just fine. The difference is when doing independently I am using BodyContentHander that comes with Tika instead of SolrContentHandler which is