Detect Bold, Italic and Strike Through text using PDFBox with VB.NET

萝らか妹 提交于 2019-11-28 06:14:43

问题


Is there a way to preserve the text formatting when extracting a PDF with PDFBox?

I have a program that parses a PDF document for information. When a new version of the PDF is released the authors use bold or italic text to indicate new information and Strike through or underlined to indicated omitted text. Using the base Stripper class in PDFbox returns all the text but the formatting is removed so I have no way of telling if the text is new or omitted. I'm currently using the project example code below:

    Dim doc As PDDocument = Nothing

    Try
        doc = PDDocument.load(RFPFilePath)
        Dim stripper As New PDFTextStripper()

        stripper.setAddMoreFormatting(True)
        stripper.setSortByPosition(True)
        rtxt_DocumentViewer.Text = stripper.getText(doc)

    Finally
        If doc IsNot Nothing Then
            doc.close()
        End If
    End Try

I have my parsing code working well if I simply copy and paste the PDF text into a richtextbox which preservers the formatting. I was thinking of doing this programatically by opening the PDF, select all, Copy, close the document then paste it in my richtextbox but that seems clunky.


回答1:


As the OP mentioned in a comment that a Java example would do and I've yet only used PDFBox with Java, this answer features a Java example. Furthermore, this example has been developed and tested with PDFBox version 1.8.11 only.

A customized text stripper

As already mentioned in a comment,

The bold and italic effects in the OP's sample document are generated by using a different font (containing bold or italic versions of the letters) to draw the text. The underline and strike-through effects in the sample document are generated by drawing a rectangle under / through the text line which has the width of the text line and a very small height. To extract these information, therefore, one has to extend the PDFTextStripper to somehow react to font changes and rectangles nearby text.

This is an example class extending the PDFTextStripper just like that:

public class PDFStyledTextStripper extends PDFTextStripper
{
    public PDFStyledTextStripper() throws IOException
    {
        super();
        registerOperatorProcessor("re", new AppendRectangleToPath());
    }

    @Override
    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    {
        for (TextPosition textPosition : textPositions)
        {
            Set<String> style = determineStyle(textPosition);
            if (!style.equals(currentStyle))
            {
                output.write(style.toString());
                currentStyle = style;
            }
            output.write(textPosition.getCharacter());
        }
    }

    Set<String> determineStyle(TextPosition textPosition)
    {
        Set<String> result = new HashSet<>();

        if (textPosition.getFont().getBaseFont().toLowerCase().contains("bold"))
            result.add("Bold");

        if (textPosition.getFont().getBaseFont().toLowerCase().contains("italic"))
            result.add("Italic");

        if (rectangles.stream().anyMatch(r -> r.underlines(textPosition)))
            result.add("Underline");

        if (rectangles.stream().anyMatch(r -> r.strikesThrough(textPosition)))
            result.add("StrikeThrough");

        return result;
    }

    class AppendRectangleToPath extends OperatorProcessor
    {
        public void process(PDFOperator operator, List<COSBase> arguments)
        {
            COSNumber x = (COSNumber) arguments.get(0);
            COSNumber y = (COSNumber) arguments.get(1);
            COSNumber w = (COSNumber) arguments.get(2);
            COSNumber h = (COSNumber) arguments.get(3);

            double x1 = x.doubleValue();
            double y1 = y.doubleValue();

            // create a pair of coordinates for the transformation
            double x2 = w.doubleValue() + x1;
            double y2 = h.doubleValue() + y1;

            Point2D p0 = transformedPoint(x1, y1);
            Point2D p1 = transformedPoint(x2, y1);
            Point2D p2 = transformedPoint(x2, y2);
            Point2D p3 = transformedPoint(x1, y2);

            rectangles.add(new TransformedRectangle(p0, p1, p2, p3));
        }

        Point2D.Double transformedPoint(double x, double y)
        {
            double[] position = {x,y}; 
            getGraphicsState().getCurrentTransformationMatrix().createAffineTransform().transform(
                    position, 0, position, 0, 1);
            return new Point2D.Double(position[0],position[1]);
        }
    }

    static class TransformedRectangle
    {
        public TransformedRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3)
        {
            this.p0 = p0;
            this.p1 = p1;
            this.p2 = p2;
            this.p3 = p3;
        }

        boolean strikesThrough(TextPosition textPosition)
        {
            Matrix matrix = textPosition.getTextPos();
            // TODO: This is a very simplistic implementation only working for horizontal text without page rotation
            // and horizontal rectangular strikeThroughs with p0 at the left bottom and p2 at the right top

            // Check if rectangle horizontally matches (at least) the text
            if (p0.getX() > matrix.getXPosition() || p2.getX() < matrix.getXPosition() + textPosition.getWidth() - textPosition.getFontSizeInPt() / 10.0)
                return false;
            // Check whether rectangle vertically is at the right height to underline
            double vertDiff = p0.getY() - matrix.getYPosition();
            if (vertDiff < 0 || vertDiff > textPosition.getFont().getFontDescriptor().getAscent() * textPosition.getFontSizeInPt() / 1000.0)
                return false;
            // Check whether rectangle is small enough to be a line
            return Math.abs(p2.getY() - p0.getY()) < 2;
        }

        boolean underlines(TextPosition textPosition)
        {
            Matrix matrix = textPosition.getTextPos();
            // TODO: This is a very simplistic implementation only working for horizontal text without page rotation
            // and horizontal rectangular underlines with p0 at the left bottom and p2 at the right top

            // Check if rectangle horizontally matches (at least) the text
            if (p0.getX() > matrix.getXPosition() || p2.getX() < matrix.getXPosition() + textPosition.getWidth() - textPosition.getFontSizeInPt() / 10.0)
                return false;
            // Check whether rectangle vertically is at the right height to underline
            double vertDiff = p0.getY() - matrix.getYPosition();
            if (vertDiff > 0 || vertDiff < textPosition.getFont().getFontDescriptor().getDescent() * textPosition.getFontSizeInPt() / 500.0)
                return false;
            // Check whether rectangle is small enough to be a line
            return Math.abs(p2.getY() - p0.getY()) < 2;
        }

        final Point2D p0, p1, p2, p3;
    }

    final List<TransformedRectangle> rectangles = new ArrayList<>();
    Set<String> currentStyle = Collections.singleton("Undefined");
}

(PDFStyledTextStripper.java)

In addition to what the PDFTextStripper does, this class also

  • collects rectangles from the content (defined using the re instruction) using an instance of the AppendRectangleToPath operator processor inner class,
  • checks text for the style variants from the sample document in determineStyle, and
  • whenever the style changes, adds the new style to the result in writeString.

Beware: This merely is a proof of concept! In particular

  • the implementations of the tests in TransformedRectangle.underlines(TextPosition) and TransformedRectangle#strikesThrough(TextPosition) are very simplistic and only work for horizontal text without page rotation and horizontal rectangular strikeThroughs and underlines with p0 at the left bottom and p2 at the right top;
  • all rectangles are collected, not checking whether they actually are filled with a visible color;
  • the tests for "bold" and "italic" merely inspect the name of the used font which may not suffice in general.

A test output

Using the PDFStyledTextStripper like this

String extractStyled(PDDocument document) throws IOException
{
    PDFTextStripper stripper = new PDFStyledTextStripper();
    stripper.setSortByPosition(true);
    return stripper.getText(document);
}

(from ExtractText.java, called from the test method testExtractStyledFromExampleDocument)

one gets the result

[]This is an example of plain text 

[Bold]This is an example of bold text 
[] 
[Underline]This is an example of underlined text[] 

[Italic]This is an example of italic text  
[] 
[StrikeThrough]This is an example of strike through text[]  

[Italic, Bold]This is an example of bold, italic text 

for the OP's sample document


PS The code of the PDFStyledTextStripper meanwhile has been slightly changed to also work for a sample document shared in a github issue, in particular the code of its inner class TransformedRectangle, cf. here.



来源:https://stackoverflow.com/questions/39962563/detect-bold-italic-and-strike-through-text-using-pdfbox-with-vb-net

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!