iText7 C# .net core extract images from pdf document

给你一囗甜甜゛ 提交于 2019-12-10 12:04:37

问题


I know similar questions have been asked before, however, they are hideously out of date (some going back to 2006).

I have a .net 3.5 app (w/ iTextSharp 5) I am converting to .net core (iText 7) which extracts signatures from FedEx tracking documents, sent in a byte[] array via a SOAP service. This code has worked very well for many years now with minor updates. There are a couple of images in the PDF document returned from Fedex but the signature block is not the 110x46 image (which is the fedex logo in the pdf file, hence why I skip over it.)

PdfReader pdf = new PdfReader(FedexData);

for(Int32 iPage = 1; iPage <= pdfReader.NumberOfPages; iPage++)
{
   PdfDictionary pg = pdf.GetPageN(iPage);
   PdfDictionary res = (PdfDictionary)PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES));
   PdfDictionary xobj = (PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT));

   foreach(PdfName name in xobj.Keys)
   {
      PdfObject obj = xobj.Get(name);

      if(obj.IsIndirect())
      {
          PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(obj);
          String width = tg.Get(PdfName.WIDTH).ToString();
          String height = tg.Get(PdfName.HEIGHT).ToString();
          String decode = tg.Contains(PdfName.DECODEPARMS) ? tg.Get(PdfName.DECODEPARMS).ToString() : "";
          String bitspercomponent = tg.Contains(PdfName.BITSPERCOMPONENT) ? tg.Get(PdfName.BITSPERCOMPONENT).ToString() : "";
          String colorspace = tg.Contains(PdfName.COLORSPACE) ? tg.Get(PdfName.COLORSPACE).ToString() : "";
          if(width != "110" && height != "46" && bitspercomponent != "1")
          {
                ImageRenderInfo imgRI = ImageRenderInfo.CreateForXObject(new GraphicsState(), (PRIndirectReference)obj, tg);
                PdfImageObject image = imgRI.GetImage();
                Image dotnetImg = image.GetDrawingImage();

                if(dotnetImg != null)
                {
                // process image and update database

Suffice to say this code doesn't work with iText7. I attempted to port some of it but I do not seem to be getting the images.... so I'm clearly doing something incorrect and its my own ignorance of the iText7 functions which do not seem to offer downward compatibility with the older library.

Can someone point me to a tutorial for iText7 which deals with extracting the images stored in a PDF file? I have found tutorials on how to extract a PDF as an image (not what I want), how to store images in a PDF document (opposite of what I want), and similar questions with answers are based on older libraries which no longer function.

Thanks, Vin


回答1:


Here is a Java implementation of an IEventListener which you can use to access all images from a specific page:

public class MyImageRenderListener implements IEventListener {

    protected String path;
    protected String extension;

    public MyImageRenderListener(String path) {
        this.path = path;
    }

    public void eventOccurred(IEventData data, EventType type) {
        switch (type) {
            case RENDER_IMAGE:
                try {
                    String filename;
                    FileOutputStream os;
                    ImageRenderInfo renderInfo = (ImageRenderInfo) data;
                    PdfImageXObject image = renderInfo.getImage();
                    if (image == null) {
                        return;
                    }

                    // You can access various value from dictionary here:
                    PdfString decodeParamsPdfStr = image.getPdfObject().getAsString(PdfName.DecodeParms);
                    String decodeParams = decodeParamsPdfStr != null ? decodeParamsPdfStr.toUnicodeString() : null;                      

                    byte[] imageByte = image.getImageBytes(true);
                    extension = image.identifyImageFileExtension();
                    // You can use raw image bytes directly, or write image to disk
                    filename = String.format(path, image.getPdfObject().getIndirectReference().getObjNumber(), extension);
                    os = new FileOutputStream(filename);
                    os.write(imageByte);
                    os.flush();
                    os.close();
                } catch (com.itextpdf.io.IOException | IOException e) {
                    System.out.println(e.getMessage());
                }
                break;

            default:
                break;
        }
    }

    public Set<EventType> getSupportedEvents() {
        return null;
    }
}

I've commented some of the parts that may be of interest to you.

And here is the code that actually invokes the processor for all pages, or for any pages of interest:

PdfDocument pdfDoc = new PdfDocument(new PdfReader(src));
IEventListener listener = new MyImageRenderListener(outPath);
PdfCanvasProcessor parser = new PdfCanvasProcessor(listener);
for (int i = 1; i <= pdfDoc.getNumberOfPages(); i++) {
    parser.processPageContent(pdfDoc.getPage(i));
}
pdfDoc.close();


来源:https://stackoverflow.com/questions/54936031/itext7-c-sharp-net-core-extract-images-from-pdf-document

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!