apache-tika | 易学教程

Indexing PDF with Solr

阅读更多关于 Indexing PDF with Solr

Can anyone point me to a tutorial. My main experience with Solr is indexing CSV files. But I cannot find any simple instructions/tutorial to tell me what I need to do to index pdfs. I have seen this: http://wiki.apache.org/solr/ExtractingRequestHandler But it makes very little sense to me. Do I need to install Tika? Im lost - please help With solr-4.9 (the latest version as of now), extracting data from rich documents like pdfs, spreadsheets(xls, xlxs family), presentations(ppt, ppts), documentation(doc, txt etc) has become fairly simple. The sample code examples provided in the downloaded

Getting MimeType subtype with Apache tika

阅读更多关于 Getting MimeType subtype with Apache tika

I'd need to get the iana.org MediaType rather than application/zip or application/x-tika-msoffice for documents like, odt, ppt, pptx, xlsx etc. If you look at mimetypes.xml there are mimeType elements composed of the iana.org mime-type and "sub-class-of" <mime-type type="application/msword"> <alias type="application/vnd.ms-word"/> ............................ <glob pattern="*.doc"/> <glob pattern="*.dot"/> <sub-class-of type="application/x-tika-msoffice"/> </mime-type> How to get the iana.org mime-type name instead of the parent type name ? When testing mime type detection, I do : MediaType

Getting MimeType subtype with Apache tika

阅读更多关于 Getting MimeType subtype with Apache tika

问题 I'd need to get the iana.org MediaType rather than application/zip or application/x-tika-msoffice for documents like, odt, ppt, pptx, xlsx etc. If you look at mimetypes.xml there are mimeType elements composed of the iana.org mime-type and "sub-class-of" <mime-type type="application/msword"> <alias type="application/vnd.ms-word"/> ............................ <glob pattern="*.doc"/> <glob pattern="*.dot"/> <sub-class-of type="application/x-tika-msoffice"/> </mime-type> How to get the iana.org

HTML Formatted Cell value from Excel using Apache POI

阅读更多关于 HTML Formatted Cell value from Excel using Apache POI

I am using apache POI to read an excel document. To say the least, it is able to serve my purpose as of now. But one thing where I am getting struck is extracting the value of cell as HTML. I have one cell wherein user will enter some string and apply some formatting(like bullets/numbers/bold/italic) etc. SO when I read it the content should be in HTML format and not a plain string format as given by POI. I have almost gone through the entire POI API but not able to find anyone. I want to remain the formatting of just one particular column and not the entire excel. By column I mean, the text

How to use Tika in server mode

阅读更多关于 How to use Tika in server mode

On Tika's website it says (concerning tika-app-1.2.jar) it can be used in server mode. Does anyone know how to send documents and receive parsed text from this server once it is running? Tika supports two "server" modes. The simpler and original is the --server flag of Tika-App. The more functional, but also more recent is the JAX-RS JSR-311 server component , which is an additional jar. The Tika-App Network Server is very simple to use. Simply start Tika-App with the --server flag, and a --port ### flag telling it what port to listen on. Then, connect to that port and send it a single file.

Tika-Parsers deployment issue on Wildfly

阅读更多关于 Tika-Parsers deployment issue on Wildfly

问题 As part of a web application i need to parse textual content of different incoming files. This should be quite simple using tika-parsers , but as soon as i try to deploy my webapp on Wildfly (tested V.8.2.1 and V.10.0.0.RC4) i run into problems. This is my maven dependency in a very basic webapp: <groupId>org.apache.tika</groupId> <artifactId>tika-parsers</artifactId> <version>1.11</version> This is the error i get during deployment (manual deployment or using arquillian for testing): Caused

Is it possible to extract text by page for word/pdf files using Apache Tika?

阅读更多关于 Is it possible to extract text by page for word/pdf files using Apache Tika?

All the documentation I can find seems to suggest I can only extract the entire file's content. But I need to extract pages individually. Do I need to write my own parser for that? Is there some obvious method that I am missing? Actually Tika does handle pages (at least in pdf) by sending elements <div><p> before page starts and </p></div> after page ends. You can easily setup page count in your handler using this (just counting pages using only <p> ): public abstract class MyContentHandler implements ContentHandler { private String pageTag = "p"; protected int pageNumber = 0; ... @Override

Apache Tika extract scanned PDF files

阅读更多关于 Apache Tika extract scanned PDF files

i'm having some troubles using Apache TIKA (version 1.10). I got some PDF files which are just scanned pieces of paper. That means each page is just an image. My goal is to extract the text of the PDF files anyway. My tesseract is set up correctly and extracting JPG and PNG files works like a charm. The code i'm using looks like that (don't mind the missing excetion handling): public String extractText(InputStream stream) { AutoDetectParser parser = new AutoDetectParser(); BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE); Metadata metadata = new Metadata(); ParseContext

PDF bullets are coming as question marks while parsing with Apache Tika in java

阅读更多关于 PDF bullets are coming as question marks while parsing with Apache Tika in java

问题 I am parsing PDF files using Apache Tika (tika-app-1.3) with this code: InputStream input = new FileInputStream("Introduction.pdf"); AutoDetectParser parser = new AutoDetectParser(); BodyContentHandler handler = new BodyContentHandler(100 * 1024 * 1024); Metadata metadata = new Metadata(); parser.parse(input, handler, metadata); System.out.println(handler.toString()); handler.toString() is displaying PDF text, but this text also contains bullets, which are showing up as a ? symbol, but I want

How to determine appropriate file extension from MIME Type in Java

阅读更多关于 How to determine appropriate file extension from MIME Type in Java

I am uploading files to an Amazon s3 bucket and have access to the InputStream and a String containing the MIME Type of the file but not the original file name. It's up to me to actually create the file name and extension before pushing the file up to S3. Is there a library or convenient way to determine the appropriate extension to use from the MIME Type? I've seen some references to the Apache Tika library but that seems like overkill and I haven't been able to get it to successfully detect file extensions yet. From what I've been able to gather it seems like this code should work, but I'm