How to extract text from a PDF file with Apache PDFBox

前端 未结 5 1395
不知归路
不知归路 2020-12-08 05:02

I would like to extract text from a given PDF file with Apache PDFBox.

I wrote this code:

PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null         


        
5条回答
  •  [愿得一人]
    2020-12-08 05:25

    I executed your code and it worked properly. Maybe your problem is related to FilePath that you have given to file. I put my pdf in C drive and hard coded the file path. Here is my code:

    // PDFBox 2.0.8 require org.apache.pdfbox.io.RandomAccessRead
    // import org.apache.pdfbox.io.RandomAccessFile;
    
    public class PDFReader{
        public static void main(String args[]) throws IOException {
            PDFTextStripper pdfStripper = null;
            PDDocument pdDoc = null;
            File file = new File("C:/my.pdf");
            PDFParser parser = new PDFParser(new FileInputStream(file));
            parser.parse();
            try (COSDocument cosDoc = parser.getDocument()) {
                pdfStripper = new PDFTextStripper();
                pdDoc = new PDDocument(cosDoc);
                pdfStripper.setStartPage(1);
                pdfStripper.setEndPage(5);
                String parsedText = pdfStripper.getText(pdDoc);
                System.out.println(parsedText);
            }
        }
    }
    

提交回复
热议问题