apache-tika | 易学教程

Correct use of Apache Tika MediaType

阅读更多关于 Correct use of Apache Tika MediaType

问题 I want to use APache Tika's MediaType class to compare mediaTypes. I first use Tika to detect the MediaType. Then I want to start an action according to the MediaType. So if the MediaType is from type XML I want to do some action, if it is a compressed file I want to start an other action. My problem is that there are many XML types, so how do I check if it is an XML using the MediaType ? Here is my previous (before Tika) implementation: if (contentType.contains("text/xml") || contentType

Mimetype check using Tika jars

阅读更多关于 Mimetype check using Tika jars

问题 I am developing standard alone Java batch process. I am trying to determine file attachment mimetype using Tika Jars. I am using Tika 1.4 Jar files. My code look like Parser parser= new AutoDetectParser(); InputStream stream = new FileInputStream(fileAttachment); int writerHandler =-1; ContentHandler contentHandler= new BodyContentHandler(writerHandler); Metadata metadata= new Metadata(); parser.parse(stream, contentHandler, metadata, new ParseContext()); String mimeType = metadata.get

Memory Leak Issue With PDFBox

阅读更多关于 Memory Leak Issue With PDFBox

问题 I am using PDF Box version 2.0.9 in my application. I have to parse large pdf files from web. Following is the code I am using MimeDetector Class @Getter @Setter class MimeTypeDetector { private ByteArrayInputStream byteArrayInputStream; private BodyContentHandler bodyContentHandler; private Metadata metadata; private ParseContext parseContext; private Detector detector; private TikaInputStream tikaInputStream; MimeTypeDetector(ByteArrayInputStream byteArrayInputStream) { this

Extract Images from PDF with Apache Tika

阅读更多关于 Extract Images from PDF with Apache Tika

问题 Apache Tika 1.6 has the ability to extract inline images from PDF documents. However, I've been struggling to get it to work. My use case is that I want some code that will extract the content and separately the images from any documents (not necessarily PDFs). This then gets passed into an Apache UIMA pipeline. I've been able to extract images from other document types by using a custom parser (built on an AutoParser) to convert the documents to HTML and then save the images out separately.

Indexing PDF with Solr

阅读更多关于 Indexing PDF with Solr

问题 Can anyone point me to a tutorial. My main experience with Solr is indexing CSV files. But I cannot find any simple instructions/tutorial to tell me what I need to do to index pdfs. I have seen this: http://wiki.apache.org/solr/ExtractingRequestHandler But it makes very little sense to me. Do I need to install Tika? Im lost - please help 回答1: With solr-4.9 (the latest version as of now), extracting data from rich documents like pdfs, spreadsheets(xls, xlxs family), presentations(ppt, ppts),

How to determine appropriate file extension from MIME Type in Java

阅读更多关于 How to determine appropriate file extension from MIME Type in Java

问题 I am uploading files to an Amazon s3 bucket and have access to the InputStream and a String containing the MIME Type of the file but not the original file name. It's up to me to actually create the file name and extension before pushing the file up to S3. Is there a library or convenient way to determine the appropriate extension to use from the MIME Type? I've seen some references to the Apache Tika library but that seems like overkill and I haven't been able to get it to successfully detect

How to compare two PDFs based on visual differences programmatically? [closed]

阅读更多关于 How to compare two PDFs based on visual differences programmatically? [closed]

问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed last year . I need to compare and get all the visual differences in the two PDF files. I know there are some questions related to this on stack overflow but they are not fulfilling my need. I'm currently using PDFBox to generate images for pages in PDF and comparing the bytes of the images. By

Java RTF import, edit and export possible?

阅读更多关于 Java RTF import, edit and export possible?

问题 I use Apache Tika to parse RTF files to get the plaintext as string. Now I want to remove some characters from this string -> ok. Now I want to save the result as RTF again. (You can think of this process as modifying an RTF file by deleting a paragraph.) How is this possible? How can I export this string to RTF with Tika? 回答1: There is a solution to edit docs, but it is a little complex. You can use the OpenOffice API to open a lot of types of docs and export it to other formats. I used it,

Define a MIME type for .TXT files for Tika

阅读更多关于 Define a MIME type for .TXT files for Tika

问题 I want to define the MIME type of *.txt files: text/txt , so that Tika can apply a more specific parser than the one used for text/plain files. The glob *.txt is included in the definition of the type text/plain in tika-mimetypes.xml . Moreover, it seems to me that you cannot redefine a MIME type in custom-mimetypes.xml , only add new globs or magic patterns. Additionally, if I define the text/txt type in tika-mimetypes.xml as a subtype of text/plain with only the glob *.txt , Tika still

issues using apache tika Parser object to parse .doc and .docx file formats

阅读更多关于 issues using apache tika Parser object to parse .doc and .docx file formats

问题 When I try to use org.apache.tika.parser.Parser and DefaultDetector() to detect and parse the .doc and .docx file formats. But I am getting some error (not exception) thrown from Tika jars and that doesn't have any helpful stack trace for me to put here. I can confirm that it is happening for .doc and .docx only. PDF, jpeg, texts are fine. Has anyone come across this problem with .doc and .docx file formats? is there any solution that you have adopted? My Code is the following: unzippedBytes