apache-tika

where to get Apache Tika jar?

十年热恋 提交于 2020-01-06 19:34:37
问题 All: I try to build app using Apache Tika to parse PDF, but I wonder where can I get those libraries like tika-core/target/tika-core- .jar and tika-parsers/target/tika-parsers- .jar I only find tika-app but there is no such jar like above. http://tika.apache.org/1.11/gettingstarted.html 回答1: Apache Tika has a large number of dependencies it needs to run. Without those present, it will do very little! You therefore need do use a dependency management tool to not only get Apache Tika, but also

Error while parsing Binary Files… (mostly PDF)

*爱你&永不变心* 提交于 2020-01-06 03:04:15
问题 I am trying to parse pdf file using Apache Tika by using ByteArrayInputStream for Binary files... And started getting error for some pdf file and for some it is parsing very well.. Earlier I was able to parse same pdf files using Tika, but now when I tried using ByteArrayInputStream, I started getting error..I think there is some problem with the ByteArray This is the Error I am getting.. org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf

Error while parsing Binary Files… (mostly PDF)

你说的曾经没有我的故事 提交于 2020-01-06 03:04:03
问题 I am trying to parse pdf file using Apache Tika by using ByteArrayInputStream for Binary files... And started getting error for some pdf file and for some it is parsing very well.. Earlier I was able to parse same pdf files using Tika, but now when I tried using ByteArrayInputStream, I started getting error..I think there is some problem with the ByteArray This is the Error I am getting.. org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf

Apache Tika ArchiveStreamFactory.detect error

大憨熊 提交于 2020-01-04 06:01:38
问题 I'm using java with apache tika 1.18 to convert some files to TXT. When I try to use the AutoDetectParser(), I'm getting the error : [ERROR ] Error occurred during error handling, give up! org.apache.commons.compress.archivers.ArchiveStreamFactory.detect(Ljava/io/InputStream;)Ljava/lang/String; [ERROR ] SRVE0777E: Exception thrown by application class 'org.apache.cxf.service.invoker.AbstractInvoker.createFault:162' org.apache.cxf.interceptor.Fault: org.apache.commons.compress.archivers

Apache Tika ArchiveStreamFactory.detect error

ⅰ亾dé卋堺 提交于 2020-01-04 06:01:22
问题 I'm using java with apache tika 1.18 to convert some files to TXT. When I try to use the AutoDetectParser(), I'm getting the error : [ERROR ] Error occurred during error handling, give up! org.apache.commons.compress.archivers.ArchiveStreamFactory.detect(Ljava/io/InputStream;)Ljava/lang/String; [ERROR ] SRVE0777E: Exception thrown by application class 'org.apache.cxf.service.invoker.AbstractInvoker.createFault:162' org.apache.cxf.interceptor.Fault: org.apache.commons.compress.archivers

Solr open document after searching a keyword

旧巷老猫 提交于 2020-01-02 10:58:33
问题 I am trying to index some PDF documents and then create a Search UI . This question is somewhat related to Solr Index PDF documents and post them to a remote server 1) Indexing PDF Docs - > I use tika jar to convert PDF to text files and then use curl command to index them. 2) Search UI --> I m using Solritas browse feature and its built in UI. Objective : When I search for a word say "Lucene" in the list of indexed documents and when I get a result set for the given query I want a link to be

how to parse html with nutch and index specific tag to solr?

為{幸葍}努か 提交于 2019-12-30 10:08:42
问题 i have installed nutch and solr for crawling a website and search in it; as you know we can index meta tags of webpages into solr with parse meta tags plugin of nutch.(http://wiki.apache.org/nutch/IndexMetatags) now i want to know is there any way to crawl another html tag to solr that isn't meta?(plugin or anyway) like this: <div id=something> me specific tag </div> indeed i want to add a field to solr (something) that have value of "me specific tag" in this page. any idea? 回答1: I made my

How do I index documents in SOLR?

混江龙づ霸主 提交于 2019-12-30 04:40:08
问题 Im running Solr 1.4 on Ubuntu 10.04 (installed via apt-get solr-tomcat) and it seems to be working fine. Im having some difficulty finding any coherent info on how to index documents though. Im new to SOLR so bear with me! I have a folder (/mnt/folder) that is a mounted windows share, which contains Word and PDF files that I would like indexed, whats the easiest way to get SOLR to index the entire folder? The documentation for SOLR is pretty poor, its impossilbe to find any decent tutorials

HTML Formatted Cell value from Excel using Apache POI

拟墨画扇 提交于 2019-12-29 06:28:08
问题 I am using apache POI to read an excel document. To say the least, it is able to serve my purpose as of now. But one thing where I am getting struck is extracting the value of cell as HTML. I have one cell wherein user will enter some string and apply some formatting(like bullets/numbers/bold/italic) etc. SO when I read it the content should be in HTML format and not a plain string format as given by POI. I have almost gone through the entire POI API but not able to find anyone. I want to

Apache Tika extract scanned PDF files

陌路散爱 提交于 2019-12-28 12:35:08
问题 i'm having some troubles using Apache TIKA (version 1.10). I got some PDF files which are just scanned pieces of paper. That means each page is just an image. My goal is to extract the text of the PDF files anyway. My tesseract is set up correctly and extracting JPG and PNG files works like a charm. The code i'm using looks like that (don't mind the missing excetion handling): public String extractText(InputStream stream) { AutoDetectParser parser = new AutoDetectParser(); BodyContentHandler