Get marked content using the MCID content

时光怂恿深爱的人放手 提交于 2020-02-08 06:18:11

问题


I am using iText to recreate the Tag Tree feature of Acrobat.

So far I have managed to get the tag structure.

The final thing I am trying to figure out is how to get & decode the "Marked Content" for a tag from the content stream.

Edit: added purpose

The intent of this question is to figure out how to access the content streams, with a mcid, and decode the content.

Edit 2: Add iText RUPS reference

Below image shows where I have reached in the tree, the red line points to a MCID, I am trying to get it's content.

Edit 3: Add current code that builds a tree

private void manipulate(PdfDictionary element, ItemCollection items)
    {
        if (element == null)
        {
            return;
        }

        ICollection<PdfName> val = element.KeySet();
        PdfObject tagName = element.Get(PdfName.S);
        PdfObject elementType = element.Get(PdfName.Type);

        string tn = "";

        if (tagName != null)
        {
            tn = ((PdfName)tagName).GetValue();
        }
        else
        {
            tn = ((PdfName)elementType).GetValue();
        }

        TreeViewItem tvI = new TreeViewItem() { Header = tn, IsExpanded = true };
        items.Add(tvI);

        PdfArray kids = element.GetAsArray(PdfName.K);
        if (kids == null)
        {
            return;
        }
        for (int i = 0; i < kids.Size(); i++)
        {
            PdfDictionary child = kids.GetAsDictionary(i); //Code change required here to detect MCID & get content, this line returns null when child is a MCID
            manipulate(child, tvI.Items);
        }
    }
}

Edit 4: Reason for this is to recreate the "Tag Tree" feature of Acrobat.


回答1:


Based on the tags you added to the question, I see that you are adding iText 7. iText 7 has a class named TaggedPdfReaderTool. This class can be used to convert Tagged PDF files to XML:

FileOutputStream outXml = new FileOutputStream("pdf_content.xml");
TaggedPdfReaderTool tool = new TaggedPdfReaderTool(document);
tool.setRootTag("root");
tool.convertToXml(outXml);
outXml.close();

The XML will have the same structure are the "tag structure" you were already able to extract. The content inside the XML tags will correspond with the content that is marked as "part of a tag" in the PDF content stream.

Important message to other readers: the screen shot in the question clearly shows that the PDF is tagged. If you try this code snippet on a PDF that isn't tagged, you won't be able to convert the content to PDF.

Update: lower level approach

You can also examine all the parts of the structure tree like this: process(document.getStructTreeRoot());

Where the process() method looks like this:

public static void process(IPdfStructElem elem) {
    if (elem == null) return;
    System.out.println(elem.getRole());
    System.out.println(elem.getClass().getName());
    if (elem instanceof PdfStructElem) {
        processStructElem((PdfStructElem) elem);
    }
    if (elem.getKids() == null) return;
    for (IPdfStructElem structElem : elem.getKids()) {
        process(structElem);
    }
}

public static void processStructElem(PdfStructElem elem) {
    PdfDictionary page = elem.getPdfObject().getAsDictionary(PdfName.Pg);
    if (page == null) return;
    PdfStream contents = page.getAsStream(PdfName.Contents);
    if (contents != null) {
        System.out.println(new String(contents.getBytes()));
    }
    PdfArray array = page.getAsArray(PdfName.Contents);
    System.out.println(array);
}

Note that the /Contents of a page can refer to a single stream, or to an array of streams. In this short snippet, I ignored all /Contents stored in an array of streams.

This is an example of the content that was revealed when executing this on a tagged PDF we use for tests:

EMC
/Artifact BMC
q
0.01961 0.33333 0.52941 rg
36 432.34 184.23 27.98 re
f
Q
EMC
/Span <</MCID 13>> BDC
q
BT
/F2 12 Tf
42 442.65 Td
1 1 1 rg
(The Library)Tj
ET
Q
EMC
/Artifact BMC
q
0.01961 0.33333 0.52941 rg
36 399.11 184.23 27.98 re
f
Q
EMC
/Span <</MCID 14>> BDC
q
BT
/F2 12 Tf
42 409.42 Td
1 1 1 rg
(The Company)Tj
ET
Q
EMC
/Span <</MCID 15>> BDC
q
BT
/F1 20 Tf
227.73 472.71 Td
(The Library)Tj
ET
Q
EMC
/Span <</MCID 16>> BDC
q
BT
/F2 12 Tf
229.23 440.45 Td
(iText is a software developer toolkit that allows users to integrate PDF)Tj
( )Tj
ET
Q
EMC
/Span <</MCID 17>> BDC
q
BT
/F2 12 Tf
229.23 424.46 Td
(functionalities within their applications, processes or products.)Tj
ET
Q
EMC
/Artifact BMC
q
0.01961 0.33333 0.52941 rg
605.03 262.75 191.73 235.31 re
f
Q
EMC
/Span <</MCID 18>> BDC
q
BT
/F1 16 Tf
676.45 482.5 Td
0.97647 0.76078 0.15294 rg
(What?)Tj
ET
Q
EMC
/Span <</MCID 19>> BDC
q
BT
/F2 12 Tf
607.94 453.08 Td
1 1 1 rg
(iText is a software developer toolkit)Tj
( )Tj
ET
Q
EMC
/Span <</MCID 20>> BDC
q
BT
/F2 12 Tf
611.61 437.09 Td
1 1 1 rg
(that allows users to integrate PDF)Tj
( )Tj
ET
Q
EMC
/Span <</MCID 21>> BDC
q
BT
/F2 12 Tf
634.95 421.11 Td
1 1 1 rg
(functionalities within their)Tj
( )Tj
ET
Q
EMC
/Span <</MCID 22>> BDC
q
BT
/F2 12 Tf
669.96 405.12 Td
1 1 1 rg
(applications)Tj
ET
Q
EMC
/Span <</MCID 23>> BDC
q
BT
/F1 16 Tf
679.12 381.5 Td
0.97647 0.76078 0.15294 rg
(How?)Tj
ET
Q
EMC
/Span <</MCID 24>> BDC
q
BT
/F2 12 Tf
613.94 352.08 Td
1 1 1 rg
(By providing you with the tools to)Tj
( )Tj
ET
Q
EMC
/Span <</MCID 25>> BDC
q
BT
/F2 12 Tf
607.59 336.09 Td
1 1 1 rg
(create and manipulate a pdf in your)Tj
( )Tj
ET
Q
EMC
/Span <</MCID 26>> BDC
q
BT
/F2 12 Tf
668.96 320.11 Td
1 1 1 rg
(source code)Tj
ET
Q
EMC
/Span <</MCID 27>> BDC
q
BT
/F1 16 Tf
672.44 296.49 Td
0.97647 0.76078 0.15294 rg
(Really?)Tj
ET
Q
EMC
/Span <</MCID 28>> BDC
q
BT
/F2 12 Tf
673.64 267.06 Td
1 1 1 rg
(Yes really!)Tj
ET
Q
EMC

Everything that is not between BMC/EDC or BDC/EDC operators is not tagged. You are looking for the content that is marked with an MCID.

In a comment, I explain that it's better to use a different approach. It is better to parse the content streams of every page (only once) and map all objects you encounter with the elements in the structure tree.

With your approach, you have to parse the content stream of a page over and over again for every structure element. That requires much more processing.



来源:https://stackoverflow.com/questions/45157963/get-marked-content-using-the-mcid-content

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!