Extract Images and Words with coordinates and sizes from PDF

回眸只為那壹抹淺笑 提交于 2019-11-30 13:39:29
Balamurugan Muthiah

Use XPDF (http://www.foolabs.com/xpdf/)

It can extract all the characters in the PDF with co-ordinates (pdftotext -bbox [sourcefile] [outputfile]) and also all the images and SVGs in the PDF.

It's open source (GPLv2) and supports a lot of additional extraction functionalities as well.

Several Java libraries can do this. Have you looked at JPedal or PdfBox?

If a commercial library is an option for you, you could try Amyuni PDF Creator .Net or Amyuni PDF Creator ActiveX. You could use the method IacDocument.GetObjectsInRectangle to retrieve all the "graphic objects" of your interest, then use the ObjectType attribute to separate images from text. The library already provides an algorithm for putting close text together. From the documentation:

IacDocument.GetObjectsInRectangle Method

The GetObjectsInRectangle method gets all the objects that are in the specified rectangle.

Usual disclaimer applies.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!