How to detect image in a document

白昼怎懂夜的黑 提交于 2019-12-10 21:27:15

问题


How can I detect images in a document say doc,xls,ppt or pdf ?

I came across with Apache Tika, I am trying its command line option. http://tika.apache.org/1.2/gettingstarted.html

But not quite sure how it will detect images.

Any help is appreciated.

Thanks


回答1:


You've said you want to use a command line solution, and not write any Java code, so it's not going to be the prettiest way to do it... If you are happy to write a little bit of Java, and create a new program to call from Python, then you can do it much nicer!

The first thing to do is to have the Tika App extract out any embedded resources within your file. Use the --extract option for this, and have the extraction occur in a special temp directory you app controls, eg

$ java -jar tika.jar --extract ../testWORD_embedded_pdf.doc
Extracting 'image1.emf' (application/x-emf)
Extracting '_1402837031.pdf' (application/pdf)

Grab the output of the extraction if you can, and parse that looking for images (but be aware that some images have an application/ prefix on their canconical mimetype!). You might need to run a second --detect step on a few, I'm not sure, test how the parsers get on with the extraction.

Now, if there were images, they'll be in your test dir. Process them as you want. Finally, zap the temp dir when you're done with the file!




回答2:


Having used Tika in the past I can't see how Tika can help with images embedded within Office documents or PDFs I was wrong to answer No. You will have may still try to resolve to native APIs like Apache POI and Apache PDFBox. Tika does use both libraries to parse text and metadata but no embedded image support.

Using Tika makes these APIs automatically available (side effect of using Tika).

UPDATE: Since Tika 0.8: look for EmbeddedResourceHandler and examples - thanks to Gagravarr.



来源:https://stackoverflow.com/questions/11932762/how-to-detect-image-in-a-document

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!