Mimetype check using Tika jars

匿名 (未验证) 提交于 2019-12-03 02:27:02

问题:

I am developing standard alone Java batch process. I am trying to determine file attachment mimetype using Tika Jars. I am using Tika 1.4 Jar files.

My code look like

Parser parser= new AutoDetectParser(); InputStream stream = new FileInputStream(fileAttachment); int writerHandler =-1; ContentHandler contentHandler= new BodyContentHandler(writerHandler); Metadata metadata= new Metadata(); parser.parse(stream, contentHandler, metadata, new ParseContext()); String mimeType = metadata.get(Metadata.CONTENT_TYPE); logger.debug("File Attachment: "+fileattachment.getName()+" MimeType is: "+mimeType); 

This code is not working properly for the office 03 and 07 documents.

While running from eclipse I am getting correct mimetypes.

I build jar file and running from command its giving wrong mimetypes.

out put from command ------------ File Attachment: Testpdf.pdf  MimeType is: application/pdf File Attachment: Testpdf.tif  MimeType is: image/tiff File Attachment: Testpdf.xlsx  MimeType is: application/x-tika-ooxml File Attachment: Testpdf.xltx  MimeType is: application/x-tika-ooxml File Attachment: Testpdf.pptx  MimeType is: application/x-tika-ooxml File Attachment: Testpdf.docx  MimeType is: application/x-tika-ooxml File Attachment: Testpdf.xls  MimeType is: application/zip File Attachment: Testpdf.doc  MimeType is: application/x-tika-msoffice File Attachment: Testpdf.dot  MimeType is: application/x-tika-msoffice File Attachment: Testpdf.ppt  MimeType is: application/x-tika-msoffice File Attachment: Testpdf.xlt  MimeType is: application/vnd.ms-excel 

I tried with OfficePraser, OOXMLParser. Its not working. I have tried with tika 0.9 jar files. mimeTypes are coming correctly but if any one of my file attachment is "editable pdf" my batch process is dying (like "exit(0);" in code). If I have new tika jars its giving wrong mimeTypes.

Please help me in this. Thanks in advance.

CVSR Sarma

回答1:

Firstly, you're using the wrong bit of Apache Tika. If all you want to know is the file type, then you should use the Detection API (javadocs) directly, eg:

TikaConfig tika = new TikaConfig();  Metadata metadata = new Metadata(); metadata.set(Metadata.RESOURCE_NAME_KEY, filename); String mimetype = tika.getDetector().detect(stream, metadata); 

If you have only the tika-core jar on your classpath, then the detection above will use Mime Magic and Filename hints. That'll let it get most files, especially if they have the right extension, but it'll struggle only wrongly named "container formats"

Container Formats are things like zip, ole2 etc, where one file format can hold many types (eg ods, xlsx, keynote all use .zip, .doc and .xls both use ole2). If you want to do detection that looks inside containers for more accurate results, you need to also include the tika-parser jar and its dependencies.

Note that, as explained in the Javadocs, your stream needs to support mark and reset for detection to work. This is so that Tika can read the first bit of your stream, look at it to work out what your file is, then return the stream to how it was ready for other uses (eg parsing). Most streams should, but if yours doesn't, the simplest way to fix it is to wrap it in a TikaInputStream via TikaInputStream.get, which sorts all that out for you



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!