How to find table border lines in pdf using PDFBox?

前端 未结 1 646
旧时难觅i
旧时难觅i 2021-01-06 08:00

I am trying to find table border lines in pdf. I used PrintTextLocations class of pdfBox to make words. Now I am looking to find the coordinates of different li

相关标签:
1条回答
  • 2021-01-06 08:15

    In the 1.8.* versions PDFBox parsing capabilities had been implemented in a not very generic way, in particular the OperatorProcessor implementations were tightly associated with specific parser classes, e.g. the implementations dealing with path drawing operations assumed to interact with a PageDrawer instance.

    Thus, unless one wanted to copy & paste all those OperatorProcessor classes with minute changes, one had to derive from such a specific parser class.

    In your case, therefore, we also will derive our parser from PageDrawer, after all we are interested in path drawing operations:

    public class PrintPaths extends PageDrawer
    {
        //
        // constructor
        //
        public PrintPaths() throws IOException
        {
            super();
        }
    
        //
        // method overrides for mere path observation
        //
        // ignore text
        @Override
        protected void processTextPosition(TextPosition text) { }
    
        // ignore bitmaps
        @Override
        public void drawImage(Image awtImage, AffineTransform at) { }
    
        // ignore shadings
        @Override
        public void shFill(COSName shadingName) throws IOException { }
    
        @Override
        public void processStream(PDPage aPage, PDResources resources, COSStream cosStream) throws IOException
        {
            PDRectangle cropBox = aPage.findCropBox();
            this.pageSize = cropBox.createDimension();
            super.processStream(aPage, resources, cosStream);
        }
    
        @Override
        public void fillPath(int windingRule) throws IOException
        {
            printPath();
            System.out.printf("Fill; windingrule: %s\n\n", windingRule);
            getLinePath().reset();
        }
    
        @Override
        public void strokePath() throws IOException
        {
            printPath();
            System.out.printf("Stroke; unscaled width: %s\n\n", getGraphicsState().getLineWidth());
            getLinePath().reset();
        }
    
        void printPath()
        {
            GeneralPath path = getLinePath();
            PathIterator pathIterator = path.getPathIterator(null);
    
            double x = 0, y = 0;
            double coords[] = new double[6];
            while (!pathIterator.isDone()) {
                switch (pathIterator.currentSegment(coords)) {
                case PathIterator.SEG_MOVETO:
                    System.out.printf("Move to (%s %s)\n", coords[0], fixY(coords[1]));
                    x = coords[0];
                    y = coords[1];
                    break;
                case PathIterator.SEG_LINETO:
                    double width = getEffectiveWidth(coords[0] - x, coords[1] - y);
                    System.out.printf("Line to (%s %s), scaled width %s\n", coords[0], fixY(coords[1]), width);
                    x = coords[0];
                    y = coords[1];
                    break;
                case PathIterator.SEG_QUADTO:
                    System.out.printf("Quad along (%s %s) and (%s %s)\n", coords[0], fixY(coords[1]), coords[2], fixY(coords[3]));
                    x = coords[2];
                    y = coords[3];
                    break;
                case PathIterator.SEG_CUBICTO:
                    System.out.printf("Cubic along (%s %s), (%s %s), and (%s %s)\n", coords[0], fixY(coords[1]), coords[2], fixY(coords[3]), coords[4], fixY(coords[5]));
                    x = coords[4];
                    y = coords[5];
                    break;
                case PathIterator.SEG_CLOSE:
                    System.out.println("Close path");
                }
                pathIterator.next();
            }
        }
    
        double getEffectiveWidth(double dirX, double dirY)
        {
            if (dirX == 0 && dirY == 0)
                return 0;
            Matrix ctm = getGraphicsState().getCurrentTransformationMatrix();
            double widthX = dirY;
            double widthY = -dirX;
            double widthXTransformed = widthX * ctm.getValue(0, 0) + widthY * ctm.getValue(1, 0);
            double widthYTransformed = widthX * ctm.getValue(0, 1) + widthY * ctm.getValue(1, 1);
            double factor = Math.sqrt((widthXTransformed*widthXTransformed + widthYTransformed*widthYTransformed) / (widthX*widthX + widthY*widthY));
            return getGraphicsState().getLineWidth() * factor;
        }
    }
    

    (PrintPaths.java)

    As we do not want to actually draw the page but merely extract the paths which would be drawn, we have to strip down the PageDrawer like this.

    This sample parser outputs path drawing operations to show how to do it. Obviously you can instead collect them for automatized processing...

    You can use the parser like this:

    PDDocument document = PDDocument.load(resource);
    List<?> allPages = document.getDocumentCatalog().getAllPages();
    int i = 7; // page 8
    
    System.out.println("\n\nPage " + (i+1));
    PrintPaths printPaths = new PrintPaths();
    
    PDPage page = (PDPage) allPages.get(i);
    PDStream contents = page.getContents();
    if (contents != null)
    {
        printPaths.processStream(page, page.findResources(), page.getContents().getStream());
    }
    

    (ExtractPaths.java)

    The output is:

    Page 8
    Move to (35.92070007324219 724.6490478515625)
    Line to (574.72998046875 724.6490478515625), scaled width 0.5981000089123845
    Stroke; unscaled width: 5.981
    
    Move to (35.92070007324219 694.4660034179688)
    Line to (574.72998046875 694.4660034179688), scaled width 0.5981000089123845
    Stroke; unscaled width: 5.981
    
    Move to (292.2610168457031 468.677001953125)
    Line to (292.8590087890625 468.677001953125), scaled width 512.9430076434463
    Stroke; unscaled width: 5129.43
    
    Move to (348.9360046386719 468.677001953125)
    Line to (349.53399658203125 468.677001953125), scaled width 512.9430076434463
    Stroke; unscaled width: 5129.43
    
    Move to (405.6090087890625 468.677001953125)
    Line to (406.2070007324219 468.677001953125), scaled width 512.9430076434463
    Stroke; unscaled width: 5129.43
    
    Move to (462.281982421875 468.677001953125)
    Line to (462.8799743652344 468.677001953125), scaled width 512.9430076434463
    Stroke; unscaled width: 5129.43
    
    Move to (518.9549560546875 468.677001953125)
    Line to (519.553955078125 468.677001953125), scaled width 512.9430076434463
    Stroke; unscaled width: 5129.43
    
    Move to (35.92070007324219 725.447998046875)
    Line to (574.72998046875 725.447998046875), scaled width 0.5981000089123845
    Stroke; unscaled width: 5.981
    
    Move to (35.92070007324219 212.5050048828125)
    Line to (574.72998046875 212.5050048828125), scaled width 0.5981000089123845
    Stroke; unscaled width: 5.981
    

    Quite peculiar: The vertical lines actually are drawn as very short (ca 0.6 units) very thick (ca 513 units) horizontal lines...

    0 讨论(0)
提交回复
热议问题