apache-tika

Indexing PDF with Solr

心不动则不痛 提交于 2019-11-30 08:26:27
Can anyone point me to a tutorial. My main experience with Solr is indexing CSV files. But I cannot find any simple instructions/tutorial to tell me what I need to do to index pdfs. I have seen this: http://wiki.apache.org/solr/ExtractingRequestHandler But it makes very little sense to me. Do I need to install Tika? Im lost - please help With solr-4.9 (the latest version as of now), extracting data from rich documents like pdfs, spreadsheets(xls, xlxs family), presentations(ppt, ppts), documentation(doc, txt etc) has become fairly simple. The sample code examples provided in the downloaded

Getting MimeType subtype with Apache tika

青春壹個敷衍的年華 提交于 2019-11-30 07:20:09
I'd need to get the iana.org MediaType rather than application/zip or application/x-tika-msoffice for documents like, odt, ppt, pptx, xlsx etc. If you look at mimetypes.xml there are mimeType elements composed of the iana.org mime-type and "sub-class-of" <mime-type type="application/msword"> <alias type="application/vnd.ms-word"/> ............................ <glob pattern="*.doc"/> <glob pattern="*.dot"/> <sub-class-of type="application/x-tika-msoffice"/> </mime-type> How to get the iana.org mime-type name instead of the parent type name ? When testing mime type detection, I do : MediaType

Getting MimeType subtype with Apache tika

∥☆過路亽.° 提交于 2019-11-29 09:30:19
问题 I'd need to get the iana.org MediaType rather than application/zip or application/x-tika-msoffice for documents like, odt, ppt, pptx, xlsx etc. If you look at mimetypes.xml there are mimeType elements composed of the iana.org mime-type and "sub-class-of" <mime-type type="application/msword"> <alias type="application/vnd.ms-word"/> ............................ <glob pattern="*.doc"/> <glob pattern="*.dot"/> <sub-class-of type="application/x-tika-msoffice"/> </mime-type> How to get the iana.org

HTML Formatted Cell value from Excel using Apache POI

雨燕双飞 提交于 2019-11-29 05:09:49
I am using apache POI to read an excel document. To say the least, it is able to serve my purpose as of now. But one thing where I am getting struck is extracting the value of cell as HTML. I have one cell wherein user will enter some string and apply some formatting(like bullets/numbers/bold/italic) etc. SO when I read it the content should be in HTML format and not a plain string format as given by POI. I have almost gone through the entire POI API but not able to find anyone. I want to remain the formatting of just one particular column and not the entire excel. By column I mean, the text

How to use Tika in server mode

故事扮演 提交于 2019-11-28 18:42:10
On Tika's website it says (concerning tika-app-1.2.jar) it can be used in server mode. Does anyone know how to send documents and receive parsed text from this server once it is running? Tika supports two "server" modes. The simpler and original is the --server flag of Tika-App. The more functional, but also more recent is the JAX-RS JSR-311 server component , which is an additional jar. The Tika-App Network Server is very simple to use. Simply start Tika-App with the --server flag, and a --port ### flag telling it what port to listen on. Then, connect to that port and send it a single file.

Tika-Parsers deployment issue on Wildfly

回眸只為那壹抹淺笑 提交于 2019-11-28 10:08:38
问题 As part of a web application i need to parse textual content of different incoming files. This should be quite simple using tika-parsers , but as soon as i try to deploy my webapp on Wildfly (tested V.8.2.1 and V.10.0.0.RC4) i run into problems. This is my maven dependency in a very basic webapp: <groupId>org.apache.tika</groupId> <artifactId>tika-parsers</artifactId> <version>1.11</version> This is the error i get during deployment (manual deployment or using arquillian for testing): Caused

Is it possible to extract text by page for word/pdf files using Apache Tika?

﹥>﹥吖頭↗ 提交于 2019-11-28 10:05:43
All the documentation I can find seems to suggest I can only extract the entire file's content. But I need to extract pages individually. Do I need to write my own parser for that? Is there some obvious method that I am missing? Actually Tika does handle pages (at least in pdf) by sending elements <div><p> before page starts and </p></div> after page ends. You can easily setup page count in your handler using this (just counting pages using only <p> ): public abstract class MyContentHandler implements ContentHandler { private String pageTag = "p"; protected int pageNumber = 0; ... @Override

Apache Tika extract scanned PDF files

可紊 提交于 2019-11-28 07:45:12
i'm having some troubles using Apache TIKA (version 1.10). I got some PDF files which are just scanned pieces of paper. That means each page is just an image. My goal is to extract the text of the PDF files anyway. My tesseract is set up correctly and extracting JPG and PNG files works like a charm. The code i'm using looks like that (don't mind the missing excetion handling): public String extractText(InputStream stream) { AutoDetectParser parser = new AutoDetectParser(); BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE); Metadata metadata = new Metadata(); ParseContext

PDF bullets are coming as question marks while parsing with Apache Tika in java

♀尐吖头ヾ 提交于 2019-11-28 06:30:59
问题 I am parsing PDF files using Apache Tika (tika-app-1.3) with this code: InputStream input = new FileInputStream("Introduction.pdf"); AutoDetectParser parser = new AutoDetectParser(); BodyContentHandler handler = new BodyContentHandler(100 * 1024 * 1024); Metadata metadata = new Metadata(); parser.parse(input, handler, metadata); System.out.println(handler.toString()); handler.toString() is displaying PDF text, but this text also contains bullets, which are showing up as a ? symbol, but I want

How to determine appropriate file extension from MIME Type in Java

扶醉桌前 提交于 2019-11-27 18:46:15
I am uploading files to an Amazon s3 bucket and have access to the InputStream and a String containing the MIME Type of the file but not the original file name. It's up to me to actually create the file name and extension before pushing the file up to S3. Is there a library or convenient way to determine the appropriate extension to use from the MIME Type? I've seen some references to the Apache Tika library but that seems like overkill and I haven't been able to get it to successfully detect file extensions yet. From what I've been able to gather it seems like this code should work, but I'm