可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I am trying to extract all the images from a pdf using itextsharp but can't seem to overcome this one hurdle.
The error occures on the line System.Drawing.Image ImgPDF = System.Drawing.Image.FromStream(MS);
giving an error of "Parameter is not valid".
I think it works when the image is a bitmap but not of any other format.
I have this following code - sorry for the length;
private void Form1_Load(object sender, EventArgs e) { FileStream fs = File.OpenRead(@"reader.pdf"); byte[] data = new byte[fs.Length]; fs.Read(data, 0, (int)fs.Length); List<System.Drawing.Image> ImgList = new List<System.Drawing.Image>(); iTextSharp.text.pdf.RandomAccessFileOrArray RAFObj = null; iTextSharp.text.pdf.PdfReader PDFReaderObj = null; iTextSharp.text.pdf.PdfObject PDFObj = null; iTextSharp.text.pdf.PdfStream PDFStremObj = null; try { RAFObj = new iTextSharp.text.pdf.RandomAccessFileOrArray(data); PDFReaderObj = new iTextSharp.text.pdf.PdfReader(RAFObj, null); for (int i = 0; i <= PDFReaderObj.XrefSize - 1; i++) { PDFObj = PDFReaderObj.GetPdfObject(i); if ((PDFObj != null) && PDFObj.IsStream()) { PDFStremObj = (iTextSharp.text.pdf.PdfStream)PDFObj; iTextSharp.text.pdf.PdfObject subtype = PDFStremObj.Get(iTextSharp.text.pdf.PdfName.SUBTYPE); if ((subtype != null) && subtype.ToString() == iTextSharp.text.pdf.PdfName.IMAGE.ToString()) { byte[] bytes = iTextSharp.text.pdf.PdfReader.GetStreamBytesRaw((iTextSharp.text.pdf.PRStream)PDFStremObj); if ((bytes != null)) { try { System.IO.MemoryStream MS = new System.IO.MemoryStream(bytes); MS.Position = 0; System.Drawing.Image ImgPDF = System.Drawing.Image.FromStream(MS); ImgList.Add(ImgPDF); } catch (Exception) { } } } } } PDFReaderObj.Close(); } catch (Exception ex) { throw new Exception(ex.Message); } } //Form1_Load
回答1:
I have used this library in the past with no problems. It should be exactly what you're after.
http://www.winnovative-software.com/PdfImgExtractor.aspx
回答2:
Resolved...
Even I got the same exception of "Parameter is not valid" and after so much of work with the help of the link provided by der_chirurg (http://kuujinbo.info/iTextSharp/CCITTFaxDecodeExtract.aspx ) I resolved it and following is the code:
using System.Drawing; using System.Drawing.Imaging; using System.IO; using iTextSharp.text.pdf.parser; using Dotnet = System.Drawing.Image; using iTextSharp.text.pdf; namespace PDF_Parsing { partial class PDF_ImgExtraction { string imgPath; private void ExtractImage(string pdfFile) { PdfReader pdfReader = new PdfReader(files[fileIndex]); for (int pageNumber = 1; pageNumber <= pdfReader.NumberOfPages; pageNumber++) { PdfReader pdf = new PdfReader(pdfFile); PdfDictionary pg = pdf.GetPageN(pageNumber); PdfDictionary res = (PdfDictionary)PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES)); PdfDictionary xobj = (PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT)); foreach (PdfName name in xobj.Keys) { PdfObject obj = xobj.Get(name); if (obj.IsIndirect()) { PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(obj); string width = tg.Get(PdfName.WIDTH).ToString(); string height = tg.Get(PdfName.HEIGHT).ToString(); ImageRenderInfo imgRI = ImageRenderInfo.CreateForXObject(new Matrix(float.Parse(width), float.Parse(height)), (PRIndirectReference)obj, tg); RenderImage(imgRI); } } } } private void RenderImage(ImageRenderInfo renderInfo) { PdfImageObject image = renderInfo.GetImage(); using (Dotnet dotnetImg = image.GetDrawingImage()) { if (dotnetImg != null) { using (MemoryStream ms = new MemoryStream()) { dotnetImg.Save(ms, ImageFormat.Tiff); Bitmap d = new Bitmap(dotnetImg); d.Save(imgPath); } } } } } }
回答3:
You need to check the stream's /Filter to see what image format a given image uses. It may be a standard image format:
- DCTDecode (jpeg)
- JPXDecode (jpeg 2000)
- JBIG2Decode (jbig is a B&W only format)
- CCITTFaxDecode (fax format, PDF supports group 3 and 4)
Other than that, you'll need to get the raw bytes (as you are), and build an image using the image stream's width, height, bits per component, number of color components (could be CMYK, indexed, RGB, or Something Weird), and a few others, as defined in section 8.9 of the ISO PDF SPECIFICATION (available for free).
So in some cases your code will work, but in others, it'll fail with the exception you mentioned.
PS: When you have an exception, PLEASE include the stack trace every single time. Pretty please with sugar on top?
回答4:
In newer version of iTextSharp, the 1st parameter of ImageRenderInfo.CreateForXObject
is not Matrix
anymore but GraphicsState
. @der_chirurg's approach should work. I tested myself with the information from the following link and it worked beautifully:
http://www.thevalvepage.com/swmonkey/2014/11/26/extract-images-from-pdf-files-using-itextsharp/
回答5:
To extract all Images on all Pages, it is not necessary to implement different filters. iTextSharp has an Image Renderer, which saves all Images in their original image type.
Just do the following found here: http://kuujinbo.info/iTextSharp/CCITTFaxDecodeExtract.aspx You don't need to implement HttpHandler...
回答6:
I added library on github which, extract images in PDF and compress them.
Could be useful, when you are going to start play with very powerful library ITextSharp.
Here the link: https://github.com/rock-walker/PdfCompression
回答7:
This works for me and I think it's a simple solution:
Write a custom RenderListener and implement its RenderImage method, something like this
public void RenderImage(ImageRenderInfo info) { PdfImageObject image = info.GetImage(); Parser.Matrix matrix = info.GetImageCTM(); var fileType = image.GetFileType(); ImageFormat format; switch (fileType) {//you may add more types here case "jpg": case "jpeg": format = ImageFormat.Jpeg; break; case "pnt": format = ImageFormat.Png; break; case "bmp": format = ImageFormat.Bmp; break; case "tiff": format = ImageFormat.Tiff; break; case "gif": format = ImageFormat.Gif; break; default: format = ImageFormat.Jpeg; break; } var pic = image.GetDrawingImage(); var x = matrix[Parser.Matrix.I31]; var y = matrix[Parser.Matrix.I32]; var width = matrix[Parser.Matrix.I11]; var height = matrix[Parser.Matrix.I22]; if (x < <some value> && y < <some value>) { return;//ignore these images } pic.Save(<path and name>, format); }