Extract Embedded XML from PDF with iTextSharp (C#)

问题

I need to extract XML data embedded in Bankruptcy court files with C#. In PDF Reader the file looks like a typical court doc. In Notepad the XML is buried in the text. I've tried extracting the text with this and another code snippet using SimpleTextExtractionStrategy. The first results in a file with no identifiable text from the PDF and the second outputs symbols. I also tried accessing it as an AcroField and Xfaform. It doesn't seem to be either of those based on the Watch window.

Stepping thru the code in Visual Studio, the XML shows up under PDFReader >> Catalog >> Keys >> Raw >> Non-Public Members >> dictionary in the Watch window. I have no idea how to get to it though. Since it's listed with other PDFNames in Watch I thought I might be able to access it via PDFReader.Catalog.GetAsDict, but it doesn't display as a PDFName. The provider of these files has a java app that seems to just reads the text. Not sure if I need to use a different extraction strategy, or directly access the catalog item containing the XML. I've never programmatically worked with PDF files or iTextSharp so I'm struggling. Any code suggestions?

回答1:

It would help if you could share a PDF with an embedded XML. When I first read your question, I assumed that the XML would have been added as a document-level attachment (stored in EmbeddedFiles) or as an attachment annotations (stored in an Annot added to a page dictionary).

Reading what is written on the uscourts.gov, it looks as if the XML is actually an XMP stream. That would mean that you can find it in the Metadata entry of the Catalog (or maybe in a page dictionary).

If you can not share the file, you will have to help yourself. You can do this by downloading iText RUPS. It is a free tool to look inside a PDF.

Browse the tree structure and look for Metadata, look for EmbeddedFiles, look for Annots. If you don't tell us how the XML is embedded, nobody will be able to help you.

See my answer to the following question for an example: How to delete attachment of PDF using itext (look at how I use RUPS to look at the Catalog > Names > EmbeddedFiles).

Extra notes: the code you've tried so far is about extracting text from a page, NOT about extracting an XML file that is embedded inside a PDF.

Update:

Now that you've shared a file, I've used RUPS to find the XML file. Take a look at the following screen shot:

Do you see what happened here? Somebody added a custom entry named /USCTbankruptcynotice with a String as value straight to the catalog. That is so wrong: it is such a bad idea to store a file inside a string. Why didn't that developer store that file as a stream? I feel so sad for the person who employs such a developer.

This being said, this is how you can extract the XML:

PdfDictionary catalog = reader.Catalog;
PdfName name = new PdfName("USCTbankruptcynotice");
PdfString USCTbankruptcynotice = catalog.GetAsString(key);
string xml = USCTbankruptcynotice.ToString();

This is written from memory. Please update my answer if you need to apply small corrections.

来源：https://stackoverflow.com/questions/28304006/extract-embedded-xml-from-pdf-with-itextsharp-c

标签

pdf

itextsharp