Extract Image and its name from pdf using iTextSharp

不羁岁月 提交于 2020-01-04 05:35:15

问题


I am using iTextSharp c# to extract images and its name from catalog pdf. I Am able to extract images from pdf, but struggling with extracting its corresponding image name as per the attached screenshot and save the file with that name. Please find the code below and let me know your suggestions. Sample PDF: https://docdro.id/PwBsNR9

Code:

private static List<System.Drawing.Image> ExtractImages(String PDFSourcePath)
{
    List<System.Drawing.Image> ImgList = new List<System.Drawing.Image>();

    iTextSharp.text.pdf.RandomAccessFileOrArray RAFObj = null;
    iTextSharp.text.pdf.PdfReader PDFReaderObj = null;
    iTextSharp.text.pdf.PdfObject PDFObj = null;
    iTextSharp.text.pdf.PdfStream PDFStremObj = null;

    try
    {
        RAFObj = new iTextSharp.text.pdf.RandomAccessFileOrArray(PDFSourcePath);
        PDFReaderObj = new iTextSharp.text.pdf.PdfReader(RAFObj, null);

        for (int i = 0; i <= PDFReaderObj.XrefSize - 1; i++)
        {
            PDFObj = PDFReaderObj.GetPdfObject(i);

            if ((PDFObj != null) && PDFObj.IsStream())
            {
                PDFStremObj = (iTextSharp.text.pdf.PdfStream)PDFObj;
                iTextSharp.text.pdf.PdfObject subtype = PDFStremObj.Get(iTextSharp.text.pdf.PdfName.SUBTYPE);
                if ((subtype != null) && subtype.ToString() == iTextSharp.text.pdf.PdfName.IMAGE.ToString())
                {
                }
                if ((subtype != null) && subtype.ToString() == iTextSharp.text.pdf.PdfName.IMAGE.ToString())
                {
                    try
                    {

                        iTextSharp.text.pdf.parser.PdfImageObject PdfImageObj =
                 new iTextSharp.text.pdf.parser.PdfImageObject((iTextSharp.text.pdf.PRStream)PDFStremObj);

                        System.Drawing.Image ImgPDF = PdfImageObj.GetDrawingImage();
                        ImgList.Add(ImgPDF);

                    }
                    catch (Exception)
                    {

                    }
                }
            }
        }
        PDFReaderObj.Close();
    }
    catch (Exception ex)
    {
        throw new Exception(ex.Message);
    }
    return ImgList;
}


回答1:


Unfortunately the example PDF is not tagged. Thus, one has to otherwise try and associate title text and image, either by analyzing the location in respect to each other or by exploiting a pattern in the content streams.

In the case at hand analyzing the location in respect to each other is feasible as the title always is (at least partially) drawn on the matching image or is the text right beneath it. Thus, one could in a first pass extract the text with position from a page and in a second one the images, at the same time looking for a title in the previously extracted text in the image area or right beneath. Alternatively one could first extract images with position and size and then extract the text in these areas.

But there also is a certain pattern in the content streams: The titel is always drawn in a single text drawing instruction right after the corresponding image is drawn. Thus, one can also go ahead and in one pass extract images and the next drawn text as associated title.

Either approach can be implemented using the iText parser API. For example in case of the latter approach as follows: first, one implements a render listener that behaves as described, i.e. saves images and the following text:

internal class ImageWithTitleRenderListener : IRenderListener
{
    int imageNumber = 0;
    String format;
    bool expectingTitle = false;

    public ImageWithTitleRenderListener(String format)
    {
        this.format = format;
    }

    public void BeginTextBlock()
    { }

    public void EndTextBlock()
    { }

    public void RenderText(TextRenderInfo renderInfo)
    {
        if (expectingTitle)
        {
            expectingTitle = false;
            File.WriteAllText(string.Format(format, imageNumber, "txt"), renderInfo.GetText());
        }
    }

    public void RenderImage(ImageRenderInfo renderInfo)
    {
        imageNumber++;
        expectingTitle = true;

        PdfImageObject imageObject = renderInfo.GetImage();

        if (imageObject == null)
        {
            Console.WriteLine("Image {0} could not be read.", imageNumber);
        }
        else
        {
            File.WriteAllBytes(string.Format(format, imageNumber, imageObject.GetFileType()), imageObject.GetImageAsBytes());
        }
    }
}

Then one parses the document pages using that render listener:

using (PdfReader reader = new PdfReader(@"EVERMOTION ARCHMODELS VOL.78.pdf"))
{
    PdfReaderContentParser parser = new PdfReaderContentParser(reader);
    ImageWithTitleRenderListener listener = new ImageWithTitleRenderListener(@"EVERMOTION ARCHMODELS VOL.78-{0:D3}.{1}");
    for (var i = 1; i <= reader.NumberOfPages; i++)
    {
        parser.ProcessContent(i, listener);
    }
}



回答2:


I hope this would help. I am doing this type of thing but if this would help.

// existing pdf path
PdfReader reader = new PdfReader(path);
PRStream pst;
PdfImageObject pio;
PdfObject po;
// number of objects in pdf document
int n = reader.XrefSize;
//FileStream fs = null;
// set image file location
//String path = "E:/";
for (int i = 0; i < n; i++)
{
    // get the object at the index i in the objects collection
    po = reader.GetPdfObject(i);
    // object not found so continue
    if (po == null || !po.IsStream())
        continue;
    //cast object to stream
    pst = (PRStream)po;
    //get the object type
    PdfObject type = pst.Get(PdfName.SUBTYPE);
    //check if the object is the image type object
    if (type != null && type.ToString().Equals(PdfName.IMAGE.ToString()))
    {
        //get the image
        pio = new PdfImageObject(pst);
        // fs = new FileStream(path + "image" + i + ".jpg", FileMode.Create);
        //read bytes of image in to an array
        byte[] imgdata = pio.GetImageAsBytes();
        try
        {
            Stream stream = new MemoryStream(imgdata);
            FileStream fs = stream as FileStream;
            if (fs != null) Console.WriteLine(fs.Name);
        }
        catch
        {
        }
    }
}

Now you can save your stream.

public void SaveStreamToFile(string fileFullPath, Stream stream)
{
    if (stream.Length == 0) return;

    // Create a FileStream object to write a stream to a file
    using (FileStream fileStream = System.IO.File.Create(fileFullPath, (int)stream.Length))
    {
        // Fill the bytes[] array with the stream data
        byte[] bytesInStream = new byte[stream.Length];
        stream.Read(bytesInStream, 0, (int)bytesInStream.Length);

        // Use FileStream object to write to the specified file
        fileStream.Write(bytesInStream, 0, bytesInStream.Length);
     }
}


来源:https://stackoverflow.com/questions/55197143/extract-image-and-its-name-from-pdf-using-itextsharp

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!