Could someone give me an example of how to extract coordinates for a 'word' using PDFBox

后端 未结 2 494
[愿得一人]
[愿得一人] 2020-12-11 13:53

Could someone give me an example of how to extract coordinates for a \'word\' with PDFBox

I am using this link to extract positions of individual characters: https:/

相关标签:
2条回答
  • 2020-12-11 14:43

    You can extract the coordinates of words by collecting all the TextPosition objects building a word and combining their bounding boxes.

    Implementing this along the lines of the two tutorials you referenced, you can extend PDFTextStripper like this:

    public class GetWordLocationAndSize extends PDFTextStripper {
        public GetWordLocationAndSize() throws IOException {
        }
    
        @Override
        protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
            String wordSeparator = getWordSeparator();
            List<TextPosition> word = new ArrayList<>();
            for (TextPosition text : textPositions) {
                String thisChar = text.getUnicode();
                if (thisChar != null) {
                    if (thisChar.length() >= 1) {
                        if (!thisChar.equals(wordSeparator)) {
                            word.add(text);
                        } else if (!word.isEmpty()) {
                            printWord(word);
                            word.clear();
                        }
                    }
                }
            }
            if (!word.isEmpty()) {
                printWord(word);
                word.clear();
            }
        }
    
        void printWord(List<TextPosition> word) {
            Rectangle2D boundingBox = null;
            StringBuilder builder = new StringBuilder();
            for (TextPosition text : word) {
                Rectangle2D box = new Rectangle2D.Float(text.getXDirAdj(), text.getYDirAdj(), text.getWidthDirAdj(), text.getHeightDir());
                if (boundingBox == null)
                    boundingBox = box;
                else
                    boundingBox.add(box);
                builder.append(text.getUnicode());
            }
            System.out.println(builder.toString() + " [(X=" + boundingBox.getX() + ",Y=" + boundingBox.getY()
                     + ") height=" + boundingBox.getHeight() + " width=" + boundingBox.getWidth() + "]");
        }
    }
    

    (ExtractWordCoordinates inner class)

    and run it like this:

    PDDocument document = PDDocument.load(resource);
    PDFTextStripper stripper = new GetWordLocationAndSize();
    stripper.setSortByPosition( true );
    stripper.setStartPage( 0 );
    stripper.setEndPage( document.getNumberOfPages() );
    
    Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
    stripper.writeText(document, dummy);
    

    (ExtractWordCoordinates test testExtractWordsForGoodJuJu)

    Applied to the apache.pdf example the tutorials use you get:

    2017-8-6 [(X=26.004425048828125,Y=22.00372314453125) height=5.833024024963379 width=36.31868362426758]
    Welcome [(X=226.44479370117188,Y=22.00372314453125) height=5.833024024963379 width=36.5999755859375]
    to [(X=265.5881652832031,Y=22.00372314453125) height=5.833024024963379 width=8.032623291015625]
    The [(X=276.1641845703125,Y=22.00372314453125) height=5.833024024963379 width=14.881439208984375]
    Apache [(X=293.5890197753906,Y=22.00372314453125) height=5.833024024963379 width=29.848846435546875]
    Software [(X=325.98126220703125,Y=22.00372314453125) height=5.833024024963379 width=35.271636962890625]
    Foundation! [(X=363.7962951660156,Y=22.00372314453125) height=5.833024024963379 width=47.871429443359375]
    Custom [(X=334.0334777832031,Y=157.6195068359375) height=4.546705722808838 width=25.03936767578125]
    Search [(X=360.8929138183594,Y=157.6195068359375) height=4.546705722808838 width=22.702728271484375]
    
    0 讨论(0)
  • 2020-12-11 14:43

    You can create CustomPDFTextStripper which extends PDFTextStripper and override protected void writeString(String text, List<TextPosition> textPositions). In this overriden method you need to split textPositions by the word separator to get List<TextPosition> for each word. After that you can join each character and compute bounding box.

    Full example below which contains also drawing of the resulting bounding boxes.

    package com.example;
    
    import lombok.Value;
    import org.apache.pdfbox.pdmodel.PDDocument;
    import org.apache.pdfbox.rendering.PDFRenderer;
    import org.apache.pdfbox.text.PDFTextStripper;
    import org.apache.pdfbox.text.TextPosition;
    import org.junit.Ignore;
    import org.junit.Test;
    
    import javax.imageio.ImageIO;
    import java.awt.*;
    import java.awt.image.BufferedImage;
    import java.io.*;
    import java.util.ArrayList;
    import java.util.List;
    import java.util.stream.Collectors;
    
    public class PdfBoxTest {
    
        private static final String BASE_DIR_PATH = "C:\\Users\\Milan\\50330484";
        private static final String INPUT_FILE_PATH = "input.pdf";
        private static final String OUTPUT_IMAGE_PATH = "output.jpg";
        private static final String OUTPUT_BBOX_IMAGE_PATH = "output-bbox.jpg";
    
        private static final float FROM_72_TO_300_DPI = 300.0f / 72.0f;
    
        @Test
        public void run() throws Exception {
            pdfToImage();
            drawBoundingBoxes();
        }
    
        @Ignore
        @Test
        public void pdfToImage() throws IOException {
            PDDocument document = PDDocument.load(new File(BASE_DIR_PATH, INPUT_FILE_PATH));
            PDFRenderer renderer = new PDFRenderer(document);
            BufferedImage image = renderer.renderImageWithDPI(0, 300);
            ImageIO.write(image, "JPEG", new File(BASE_DIR_PATH, OUTPUT_IMAGE_PATH));
        }
    
        @Ignore
        @Test
        public void drawBoundingBoxes() throws IOException {
    
            PDDocument document = PDDocument.load(new File(BASE_DIR_PATH, INPUT_FILE_PATH));
    
            List<WordWithBBox> words = getWords(document);
    
            draw(words);
        }
    
        private List<WordWithBBox> getWords(PDDocument document) throws IOException {
    
            CustomPDFTextStripper customPDFTextStripper = new CustomPDFTextStripper();
            customPDFTextStripper.setSortByPosition(true);
            customPDFTextStripper.setStartPage(0);
            customPDFTextStripper.setEndPage(1);
    
            Writer writer = new OutputStreamWriter(new ByteArrayOutputStream());
            customPDFTextStripper.writeText(document, writer);
    
            List<WordWithBBox> words = customPDFTextStripper.getWords();
    
            return words;
        }
    
        private void draw(List<WordWithBBox> words) throws IOException {
    
            BufferedImage bufferedImage = ImageIO.read(new File(BASE_DIR_PATH, OUTPUT_IMAGE_PATH));
    
            Graphics2D graphics = bufferedImage.createGraphics();
    
            graphics.setColor(Color.GREEN);
    
            List<Rectangle> rectangles = words.stream()
                    .map(word -> new Rectangle(word.getX(), word.getY(), word.getWidth(), word.getHeight()))
                    .collect(Collectors.toList());
            rectangles.forEach(graphics::draw);
    
            graphics.dispose();
    
            ImageIO.write(bufferedImage, "JPEG", new File(BASE_DIR_PATH, OUTPUT_BBOX_IMAGE_PATH));
        }
    
        private class CustomPDFTextStripper extends PDFTextStripper {
    
            private final List<WordWithBBox> words;
    
            public CustomPDFTextStripper() throws IOException {
                this.words = new ArrayList<>();
            }
    
            public List<WordWithBBox> getWords() {
                return new ArrayList<>(words);
            }
    
            @Override
            protected void writeString(String text, List<TextPosition> textPositions) throws IOException {
    
                String wordSeparator = getWordSeparator();
                List<TextPosition> wordTextPositions = new ArrayList<>();
    
                for (TextPosition textPosition : textPositions) {
                    String str = textPosition.getUnicode();
                    if (wordSeparator.equals(str)) {
                        if (!wordTextPositions.isEmpty()) {
                            this.words.add(createWord(wordTextPositions));
                            wordTextPositions.clear();
                        }
                    } else {
                        wordTextPositions.add(textPosition);
                    }
                }
    
                super.writeString(text, textPositions);
            }
    
            private WordWithBBox createWord(List<TextPosition> wordTextPositions) {
    
                String word = wordTextPositions.stream()
                        .map(TextPosition::getUnicode)
                        .collect(Collectors.joining());
    
                int minX = Integer.MAX_VALUE;
                int minY = Integer.MAX_VALUE;
                int maxX = Integer.MIN_VALUE;
                int maxY = Integer.MIN_VALUE;
    
                for (TextPosition wordTextPosition : wordTextPositions) {
    
                    minX = Math.min(minX, from72To300Dpi(wordTextPosition.getXDirAdj()));
                    minY = Math.min(minY, from72To300Dpi(wordTextPosition.getYDirAdj() - wordTextPosition.getHeightDir()));
                    maxX = Math.max(maxX, from72To300Dpi(wordTextPosition.getXDirAdj() + wordTextPosition.getWidthDirAdj()));
                    maxY = Math.max(maxY, from72To300Dpi(wordTextPosition.getYDirAdj()));
                }
    
                return new WordWithBBox(word, minX, minY, maxX - minX, maxY - minY);
            }
        }
    
        private int from72To300Dpi(float f) {
            return Math.round(f * FROM_72_TO_300_DPI);
        }
    
        @Value
        private class WordWithBBox {
            private final String word;
            private final int x;
            private final int y;
            private final int width;
            private final int height;
        }
    }
    

    Note:

    If you are interested in other options, you can check also Poppler

    PDF to image

    pdftoppm -r 300 -jpeg input.pdf output
    

    Generate an XHTML file containing bounding box information for each word in the file.

    pdftotext -r 300 -bbox input.pdf
    
    0 讨论(0)
提交回复
热议问题