问题
I am looking to simply extract all images from a pdf. I found some code that looks like it is exactly what I need
Private Sub getAllImages(ByVal dict As pdf.PdfDictionary, ByVal images As List(Of Byte()), ByVal doc As pdf.PdfReader)
Dim res As pdf.PdfDictionary = CType(pdf.PdfReader.GetPdfObject(dict.Get(pdf.PdfName.RESOURCES)), pdf.PdfDictionary)
Dim xobj As pdf.PdfDictionary = CType(pdf.PdfReader.GetPdfObject(res.Get(pdf.PdfName.XOBJECT)), pdf.PdfDictionary)
If xobj IsNot Nothing Then
For Each name As pdf.PdfName In xobj.Keys
Dim obj As pdf.PdfObject = xobj.Get(name)
If (obj.IsIndirect) Then
Dim tg As pdf.PdfDictionary = CType(pdf.PdfReader.GetPdfObject(obj), pdf.PdfDictionary)
Dim subtype As pdf.PdfName = CType(pdf.PdfReader.GetPdfObject(tg.Get(pdf.PdfName.SUBTYPE)), pdf.PdfName)
If pdf.PdfName.IMAGE.Equals(subtype) Then
Dim xrefIdx As Integer = CType(obj, pdf.PRIndirectReference).Number
Dim pdfObj As pdf.PdfObject = doc.GetPdfObject(xrefIdx)
Dim str As pdf.PdfStream = CType(pdfObj, pdf.PdfStream)
Dim bytes As Byte() = pdf.PdfReader.GetStreamBytesRaw(CType(str, pdf.PRStream))
Dim filter As String = tg.Get(pdf.PdfName.FILTER).ToString
Dim width As String = tg.Get(pdf.PdfName.WIDTH).ToString
Dim height As String = tg.Get(pdf.PdfName.HEIGHT).ToString
Dim bpp As String = tg.Get(pdf.PdfName.BITSPERCOMPONENT).ToString
If filter = "/FlateDecode" Then
bytes = pdf.PdfReader.FlateDecode(bytes, True)
Dim pixelFormat As System.Drawing.Imaging.PixelFormat
Select Case Integer.Parse(bpp)
Case 1
pixelFormat = Drawing.Imaging.PixelFormat.Format1bppIndexed
Case 24
pixelFormat = Drawing.Imaging.PixelFormat.Format24bppRgb
Case Else
Throw New Exception("Unknown pixel format " + bpp)
End Select
Dim bmp As New System.Drawing.Bitmap(Int32.Parse(width), Int32.Parse(height), pixelFormat)
Dim bmd As System.Drawing.Imaging.BitmapData = bmp.LockBits(New System.Drawing.Rectangle(0, 0, Int32.Parse(width), Int32.Parse(height)), System.Drawing.Imaging.ImageLockMode.WriteOnly, pixelFormat)
Marshal.Copy(bytes, 0, bmd.Scan0, bytes.Length)
bmp.UnlockBits(bmd)
Using ms As New MemoryStream
bmp.Save(ms, System.Drawing.Imaging.ImageFormat.Png)
bytes = ms.GetBuffer
End Using
End If
images.Add(bytes)
ElseIf pdf.PdfName.FORM.Equals(subtype) Or pdf.PdfName.GROUP.Equals(subtype) Then
getAllImages(tg, images, doc)
End If
End If
Next
End If End Sub
Now my issue is simply, how can I call this, I do not know what to set the dict variable to or the images list??
So in essance if I have a PDF located at C:\temp\test.pdf that contains images, how do I call this?
Dim x As New FileStream("C:\image\test.pdf", FileMode.Open)
Dim reader As New iTextSharp.text.pdf.PdfReader(x)
getAllImages(?????, ?????? ,reader)
回答1:
The way this person wrote this method can seem weird if you don't understand the internals of PDFs and/or iTextSharp. The method takes three parameters, the first is a PdfDictionary
which you obtain by calling GetPageN(Integer)
on each of your pages. The second is a generic list which you need to init on your own before calling this. This method is intended to be called in a loop for each page in a PDF and each call will append images to this list. The last parameter you understand already.
So here's the code to call this method:
''//Source file to read images from
Dim InputFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "FileWithImages.pdf")
''//List to dump images into
Dim Images As New List(Of Byte())
''//Main PDF reader
Dim Reader As New PdfReader(InputFile)
''//Total number of pages in the PDF
Dim PageCount = Reader.NumberOfPages
''//Loop through each page (first page is one, not zero)
For I = 1 To PageCount
getAllImages(Reader.GetPageN(I), Images, Reader)
Next
VERY, VERY IMPORTANT - iTextSharp is NOT a PDF renderer, it is a PDF composer. What this means is that it knows it has image-like objects but it doesn't necessarily know much about them. To say it another way, iTextSharp knows that a given byte array represents something that the PDF standard says is an image but it doesn't know or care if its a JPEG, TIFF, BMP or something else. All iTextSharp cares about is that this object has a few standard properties it can manipulate like X,Y and effective width and height. PDF renderers will handle the job of converting the bytes to an actual image. In this can, you are the PDF renderer so its your job to figure out how to process the byte array as an image.
Specifically, you'll see in that method that there's a line that reads:
If filter = "/FlateDecode" Then
This is often written as a select case
or switch
statement to process the various values of filter
. The method you are referencing only handles FlateDecode
which is pretty common although there are actually 10 standard filters such as CCITTFaxDecode
, JBIG2Decode
and DCTDecode
(PDF Spec 7.4 - Filters). You should modify the method to include a catch of some sort (an Else
or Default
case) so that you are at least aware of images you aren't set up to process.
Additionally, within the /FlatDecode
section you'll see this line:
Select Case Integer.Parse(bpp)
This is reading an attribute associated with the image object that tells the renderer how many bits should be used for each color when parsing. Once again, you are the PDF renderer in this case so its up to you to figure out what to do. The code that you referenced only accounts for monochrome (1 bpp) or truecolor (24 bpp) images but others should definitely be accounted for, especially 8 bpp.
So summing this up, hopefully the code works for you as is, but don't be surprised if it complains a lot and/or misses images. Extracting images can actually be very frustrating at times. If you do run into problems start a new question here referencing this one and hopefully we can help you more!
来源:https://stackoverflow.com/questions/9237867/itextsharp-in-net-extract-image-code-i-cant-get-this-to-work