description on each constans specified in pdfname, since i need to be able to retrieve both images and text at the same time

天大地大妈咪最大 提交于 2019-12-13 22:04:46

问题


i am having a trouble in retrieving images and text in a pdf file at the same, i was able to get images and text in a pdf file but not at the same time (this will cause a question of whether to render the image first or the text first for example in my panel control?), maybe if you guys can help me define what does each constants in pdfname means? i tried using pdfname.all but it returns null, but when using pdfname.resources it returns procset, font and xobject. i used xobject for image, but what are procset and font (could this be the style of the text? does it have pdfname.text for retrieving text)?

thanks in advance.


回答1:


First of all,

i am having a trouble in retrieving images and text in a pdf file at the same

for this task you should use the iText(Sharp) parser API. In iTextSharp you essentially implement IRenderListener (an interface with methods for being informed about (bitmap) images and text fragments in a content stream) and process the page contents with it:

PdfReader reader = new PdfReader(...);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
int pageNumber = [... the number of the page you are interested in; may be a loop variable ...];

IRenderListener listener = new [... your IRenderListener implementation ...]
parser.ProcessContent(pageNumber, listener);

You ask

whether to render the image first or the text first for example in my panel control

The IRenderListener methods also retrieve information on the location of the bitmap or text fragment in question.

For ideas how the text fragments may be combined in your listener, you may want to be inspired by the implementations SimpleTextExtractionStrategy or LocationTextExtractionStrategy present in iTextSharp.

If you insist on doing it manually, though...

maybe if you guys can help me define what does each constants in pdfname means?

You find the definitions of what the names map to in the PDF specification ISO 32000-1:2008 a copy of which Adobe made available here.

when using pdfname.resources it returns procset, font and xobject. i used xobject for image, but what are procset and font (could this be the style of the text?

The contents of the page Resource Dictionaries are explained in section 7.8.3 of the specification.

does it have pdfname.text for retrieving text)?

You'll find how test is presented in page content streams and xobjects in section 9.



来源:https://stackoverflow.com/questions/17645840/description-on-each-constans-specified-in-pdfname-since-i-need-to-be-able-to-re

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!