Extract image from PDF using itextsharp

匿名 (未验证) 提交于 2019-12-03 02:11:02

问题:

I am trying to extract all the images from a pdf using itextsharp but can't seem to overcome this one hurdle.

The error occures on the line System.Drawing.Image ImgPDF = System.Drawing.Image.FromStream(MS); giving an error of "Parameter is not valid".

I think it works when the image is a bitmap but not of any other format.

I have this following code - sorry for the length;

    private void Form1_Load(object sender, EventArgs e)     {         FileStream fs = File.OpenRead(@"reader.pdf");         byte[] data = new byte[fs.Length];         fs.Read(data, 0, (int)fs.Length);          List<System.Drawing.Image> ImgList = new List<System.Drawing.Image>();          iTextSharp.text.pdf.RandomAccessFileOrArray RAFObj = null;         iTextSharp.text.pdf.PdfReader PDFReaderObj = null;         iTextSharp.text.pdf.PdfObject PDFObj = null;         iTextSharp.text.pdf.PdfStream PDFStremObj = null;          try         {             RAFObj = new iTextSharp.text.pdf.RandomAccessFileOrArray(data);             PDFReaderObj = new iTextSharp.text.pdf.PdfReader(RAFObj, null);              for (int i = 0; i <= PDFReaderObj.XrefSize - 1; i++)             {                 PDFObj = PDFReaderObj.GetPdfObject(i);                  if ((PDFObj != null) && PDFObj.IsStream())                 {                     PDFStremObj = (iTextSharp.text.pdf.PdfStream)PDFObj;                     iTextSharp.text.pdf.PdfObject subtype = PDFStremObj.Get(iTextSharp.text.pdf.PdfName.SUBTYPE);                      if ((subtype != null) && subtype.ToString() == iTextSharp.text.pdf.PdfName.IMAGE.ToString())                     {                         byte[] bytes = iTextSharp.text.pdf.PdfReader.GetStreamBytesRaw((iTextSharp.text.pdf.PRStream)PDFStremObj);                          if ((bytes != null))                         {                             try                             {                                 System.IO.MemoryStream MS = new System.IO.MemoryStream(bytes);                                  MS.Position = 0;                                 System.Drawing.Image ImgPDF = System.Drawing.Image.FromStream(MS);                                  ImgList.Add(ImgPDF);                              }                             catch (Exception)                             {                             }                         }                     }                 }             }             PDFReaderObj.Close();         }         catch (Exception ex)         {             throw new Exception(ex.Message);         }        } //Form1_Load 

回答1:

I have used this library in the past with no problems. It should be exactly what you're after.

http://www.winnovative-software.com/PdfImgExtractor.aspx



回答2:

Resolved...

Even I got the same exception of "Parameter is not valid" and after so much of work with the help of the link provided by der_chirurg (http://kuujinbo.info/iTextSharp/CCITTFaxDecodeExtract.aspx ) I resolved it and following is the code:

using System.Drawing; using System.Drawing.Imaging; using System.IO; using iTextSharp.text.pdf.parser; using Dotnet = System.Drawing.Image; using iTextSharp.text.pdf;  namespace PDF_Parsing {     partial class PDF_ImgExtraction     {         string imgPath;         private void ExtractImage(string pdfFile)         {             PdfReader pdfReader = new PdfReader(files[fileIndex]);             for (int pageNumber = 1; pageNumber <= pdfReader.NumberOfPages; pageNumber++)             {                 PdfReader pdf = new PdfReader(pdfFile);                 PdfDictionary pg = pdf.GetPageN(pageNumber);                 PdfDictionary res = (PdfDictionary)PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES));                 PdfDictionary xobj = (PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT));                 foreach (PdfName name in xobj.Keys)                 {                     PdfObject obj = xobj.Get(name);                     if (obj.IsIndirect())                     {                         PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(obj);                         string width = tg.Get(PdfName.WIDTH).ToString();                         string height = tg.Get(PdfName.HEIGHT).ToString();                         ImageRenderInfo imgRI = ImageRenderInfo.CreateForXObject(new Matrix(float.Parse(width), float.Parse(height)), (PRIndirectReference)obj, tg);                         RenderImage(imgRI);                     }                 }             }         }         private void RenderImage(ImageRenderInfo renderInfo)         {             PdfImageObject image = renderInfo.GetImage();             using (Dotnet dotnetImg = image.GetDrawingImage())             {                 if (dotnetImg != null)                 {                     using (MemoryStream ms = new MemoryStream())                     {                         dotnetImg.Save(ms, ImageFormat.Tiff);                         Bitmap d = new Bitmap(dotnetImg);                         d.Save(imgPath);                     }                 }             }         }     } } 


回答3:

You need to check the stream's /Filter to see what image format a given image uses. It may be a standard image format:

  • DCTDecode (jpeg)
  • JPXDecode (jpeg 2000)
  • JBIG2Decode (jbig is a B&W only format)
  • CCITTFaxDecode (fax format, PDF supports group 3 and 4)

Other than that, you'll need to get the raw bytes (as you are), and build an image using the image stream's width, height, bits per component, number of color components (could be CMYK, indexed, RGB, or Something Weird), and a few others, as defined in section 8.9 of the ISO PDF SPECIFICATION (available for free).

So in some cases your code will work, but in others, it'll fail with the exception you mentioned.

PS: When you have an exception, PLEASE include the stack trace every single time. Pretty please with sugar on top?



回答4:

In newer version of iTextSharp, the 1st parameter of ImageRenderInfo.CreateForXObject is not Matrix anymore but GraphicsState. @der_chirurg's approach should work. I tested myself with the information from the following link and it worked beautifully:

http://www.thevalvepage.com/swmonkey/2014/11/26/extract-images-from-pdf-files-using-itextsharp/



回答5:

To extract all Images on all Pages, it is not necessary to implement different filters. iTextSharp has an Image Renderer, which saves all Images in their original image type.

Just do the following found here: http://kuujinbo.info/iTextSharp/CCITTFaxDecodeExtract.aspx You don't need to implement HttpHandler...



回答6:

I added library on github which, extract images in PDF and compress them.

Could be useful, when you are going to start play with very powerful library ITextSharp.

Here the link: https://github.com/rock-walker/PdfCompression



回答7:

This works for me and I think it's a simple solution:

Write a custom RenderListener and implement its RenderImage method, something like this

    public void RenderImage(ImageRenderInfo info)     {         PdfImageObject image = info.GetImage();         Parser.Matrix matrix = info.GetImageCTM();         var fileType = image.GetFileType();         ImageFormat format;         switch (fileType)         {//you may add more types here             case "jpg":             case "jpeg":                 format = ImageFormat.Jpeg;                 break;             case "pnt":                 format = ImageFormat.Png;                 break;             case "bmp":                 format = ImageFormat.Bmp;                 break;             case "tiff":                 format = ImageFormat.Tiff;                 break;             case "gif":                 format = ImageFormat.Gif;                 break;             default:                 format = ImageFormat.Jpeg;                 break;         }          var pic = image.GetDrawingImage();         var x = matrix[Parser.Matrix.I31];         var y = matrix[Parser.Matrix.I32];         var width = matrix[Parser.Matrix.I11];         var height = matrix[Parser.Matrix.I22];         if (x < <some value> && y < <some value>)         {             return;//ignore these images         }          pic.Save(<path and name>, format); } 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!