PDF find out if text is underlined or a table cell

后端 未结 5 1877
遇见更好的自我
遇见更好的自我 2020-12-30 18:45

I have been playing around with PdfBox and PDFTextStripperByArea method.

I was able to extract information if the text is bold or italic, b

5条回答
  •  既然无缘
    2020-12-30 19:16

    Here is what I have found out so far:

    PDFBox uses a resource file to bound PDF operators/instructions to certain classes which then process the information.

    If we take a look at the PDFTextStripper.properties resource file under:

    pdfbox\src\main\resources\org\apache\pdfbox\resources\

    we can see that for instance the BT operator is bound to the org.apache.pdfbox.util.operator.BeginText class and so on.

    The PDFTextStripper under

    pdfbox\src\main\java\org\apache\pdfbox\util\

    takes this into account and utilizes the processing of the PDF with this classes.

    BUT all graphical objects are ignored, therefore no information of underline or table structure!

    Now if we take a look at the PageDrawer.properties resource file we can see that this one bounds to almost all operators available. Which is utilized by PageDrawer class under

    pdfbox\src\main\java\org\apache\pdfbox\pdfviewer\

    The "trick" is now to find out which graphical operators are those who represent underline and tables and to use them in combination with PDFTextStripper.

    Now this would mean reading the PDF file specification, which is currently way to much work.

    If someone knows which operators are responsible for which actions to draw underlines and table lines please let me know.

提交回复
热议问题