问题
I am using iText to recreate the Tag Tree feature of Acrobat.
So far I have managed to get the tag structure.
The final thing I am trying to figure out is how to get & decode the "Marked Content" for a tag from the content stream.
Edit: added purpose
The intent of this question is to figure out how to access the content streams, with a mcid, and decode the content.
Edit 2: Add iText RUPS reference
Below image shows where I have reached in the tree, the red line points to a MCID, I am trying to get it's content.
Edit 3: Add current code that builds a tree
private void manipulate(PdfDictionary element, ItemCollection items)
{
if (element == null)
{
return;
}
ICollection<PdfName> val = element.KeySet();
PdfObject tagName = element.Get(PdfName.S);
PdfObject elementType = element.Get(PdfName.Type);
string tn = "";
if (tagName != null)
{
tn = ((PdfName)tagName).GetValue();
}
else
{
tn = ((PdfName)elementType).GetValue();
}
TreeViewItem tvI = new TreeViewItem() { Header = tn, IsExpanded = true };
items.Add(tvI);
PdfArray kids = element.GetAsArray(PdfName.K);
if (kids == null)
{
return;
}
for (int i = 0; i < kids.Size(); i++)
{
PdfDictionary child = kids.GetAsDictionary(i); //Code change required here to detect MCID & get content, this line returns null when child is a MCID
manipulate(child, tvI.Items);
}
}
}
Edit 4: Reason for this is to recreate the "Tag Tree" feature of Acrobat.
回答1:
Based on the tags you added to the question, I see that you are adding iText 7. iText 7 has a class named TaggedPdfReaderTool. This class can be used to convert Tagged PDF files to XML:
FileOutputStream outXml = new FileOutputStream("pdf_content.xml");
TaggedPdfReaderTool tool = new TaggedPdfReaderTool(document);
tool.setRootTag("root");
tool.convertToXml(outXml);
outXml.close();
The XML will have the same structure are the "tag structure" you were already able to extract. The content inside the XML tags will correspond with the content that is marked as "part of a tag" in the PDF content stream.
Important message to other readers: the screen shot in the question clearly shows that the PDF is tagged. If you try this code snippet on a PDF that isn't tagged, you won't be able to convert the content to PDF.
Update: lower level approach
You can also examine all the parts of the structure tree like this: process(document.getStructTreeRoot());
Where the process() method looks like this:
public static void process(IPdfStructElem elem) {
if (elem == null) return;
System.out.println(elem.getRole());
System.out.println(elem.getClass().getName());
if (elem instanceof PdfStructElem) {
processStructElem((PdfStructElem) elem);
}
if (elem.getKids() == null) return;
for (IPdfStructElem structElem : elem.getKids()) {
process(structElem);
}
}
public static void processStructElem(PdfStructElem elem) {
PdfDictionary page = elem.getPdfObject().getAsDictionary(PdfName.Pg);
if (page == null) return;
PdfStream contents = page.getAsStream(PdfName.Contents);
if (contents != null) {
System.out.println(new String(contents.getBytes()));
}
PdfArray array = page.getAsArray(PdfName.Contents);
System.out.println(array);
}
Note that the /Contents of a page can refer to a single stream, or to an array of streams. In this short snippet, I ignored all /Contents stored in an array of streams.
This is an example of the content that was revealed when executing this on a tagged PDF we use for tests:
EMC
/Artifact BMC
q
0.01961 0.33333 0.52941 rg
36 432.34 184.23 27.98 re
f
Q
EMC
/Span <</MCID 13>> BDC
q
BT
/F2 12 Tf
42 442.65 Td
1 1 1 rg
(The Library)Tj
ET
Q
EMC
/Artifact BMC
q
0.01961 0.33333 0.52941 rg
36 399.11 184.23 27.98 re
f
Q
EMC
/Span <</MCID 14>> BDC
q
BT
/F2 12 Tf
42 409.42 Td
1 1 1 rg
(The Company)Tj
ET
Q
EMC
/Span <</MCID 15>> BDC
q
BT
/F1 20 Tf
227.73 472.71 Td
(The Library)Tj
ET
Q
EMC
/Span <</MCID 16>> BDC
q
BT
/F2 12 Tf
229.23 440.45 Td
(iText is a software developer toolkit that allows users to integrate PDF)Tj
( )Tj
ET
Q
EMC
/Span <</MCID 17>> BDC
q
BT
/F2 12 Tf
229.23 424.46 Td
(functionalities within their applications, processes or products.)Tj
ET
Q
EMC
/Artifact BMC
q
0.01961 0.33333 0.52941 rg
605.03 262.75 191.73 235.31 re
f
Q
EMC
/Span <</MCID 18>> BDC
q
BT
/F1 16 Tf
676.45 482.5 Td
0.97647 0.76078 0.15294 rg
(What?)Tj
ET
Q
EMC
/Span <</MCID 19>> BDC
q
BT
/F2 12 Tf
607.94 453.08 Td
1 1 1 rg
(iText is a software developer toolkit)Tj
( )Tj
ET
Q
EMC
/Span <</MCID 20>> BDC
q
BT
/F2 12 Tf
611.61 437.09 Td
1 1 1 rg
(that allows users to integrate PDF)Tj
( )Tj
ET
Q
EMC
/Span <</MCID 21>> BDC
q
BT
/F2 12 Tf
634.95 421.11 Td
1 1 1 rg
(functionalities within their)Tj
( )Tj
ET
Q
EMC
/Span <</MCID 22>> BDC
q
BT
/F2 12 Tf
669.96 405.12 Td
1 1 1 rg
(applications)Tj
ET
Q
EMC
/Span <</MCID 23>> BDC
q
BT
/F1 16 Tf
679.12 381.5 Td
0.97647 0.76078 0.15294 rg
(How?)Tj
ET
Q
EMC
/Span <</MCID 24>> BDC
q
BT
/F2 12 Tf
613.94 352.08 Td
1 1 1 rg
(By providing you with the tools to)Tj
( )Tj
ET
Q
EMC
/Span <</MCID 25>> BDC
q
BT
/F2 12 Tf
607.59 336.09 Td
1 1 1 rg
(create and manipulate a pdf in your)Tj
( )Tj
ET
Q
EMC
/Span <</MCID 26>> BDC
q
BT
/F2 12 Tf
668.96 320.11 Td
1 1 1 rg
(source code)Tj
ET
Q
EMC
/Span <</MCID 27>> BDC
q
BT
/F1 16 Tf
672.44 296.49 Td
0.97647 0.76078 0.15294 rg
(Really?)Tj
ET
Q
EMC
/Span <</MCID 28>> BDC
q
BT
/F2 12 Tf
673.64 267.06 Td
1 1 1 rg
(Yes really!)Tj
ET
Q
EMC
Everything that is not between BMC/EDC or BDC/EDC operators is not tagged. You are looking for the content that is marked with an MCID.
In a comment, I explain that it's better to use a different approach. It is better to parse the content streams of every page (only once) and map all objects you encounter with the elements in the structure tree.
With your approach, you have to parse the content stream of a page over and over again for every structure element. That requires much more processing.
来源:https://stackoverflow.com/questions/45157963/get-marked-content-using-the-mcid-content