How to extract text from a PDF file with Apache PDFBox

前端 未结 5 1393
不知归路
不知归路 2020-12-08 05:02

I would like to extract text from a given PDF file with Apache PDFBox.

I wrote this code:

PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null         


        
相关标签:
5条回答
  • 2020-12-08 05:14

    Using PDFBox 2.0.7, this is how I get the text of a PDF:

    static String getText(File pdfFile) throws IOException {
        PDDocument doc = PDDocument.load(pdfFile);
        return new PDFTextStripper().getText(doc);
    }
    

    Call it like this:

    try {
        String text = getText(new File("/home/me/test.pdf"));
        System.out.println("Text in PDF: " + text);
    } catch (IOException e) {
        e.printStackTrace();
    }
    

    Since user oivemaria asked in the comments:

    You can use PDFBox in your application by adding it to your dependencies in build.gradle:

    dependencies {
      compile group: 'org.apache.pdfbox', name: 'pdfbox', version: '2.0.7'
    }
    

    Here's more on dependency management using Gradle.


    If you want to keep the PDF's format in the parsed text, give PDFLayoutTextStripper a try.

    0 讨论(0)
  • 2020-12-08 05:22

    Maven dep:

        <dependency>
            <groupId>org.apache.pdfbox</groupId>
            <artifactId>pdfbox</artifactId>
            <version>2.0.9</version>
        </dependency>
    

    Then the fucntion to get the pdf text as String.

    private static String readPDF(File pdf) throws InvalidPasswordException, IOException {
        try (PDDocument document = PDDocument.load(pdf)) {
    
            document.getClass();
    
            if (!document.isEncrypted()) {
    
                PDFTextStripperByArea stripper = new PDFTextStripperByArea();
                stripper.setSortByPosition(true);
    
                PDFTextStripper tStripper = new PDFTextStripper();
    
                String pdfFileInText = tStripper.getText(document);
                // System.out.println("Text:" + st);
    
                // split by whitespace
                String lines[] = pdfFileInText.split("\\r?\\n");
                List<String> pdfLines = new ArrayList<>();
                StringBuilder sb = new StringBuilder();
                for (String line : lines) {
                    System.out.println(line);
                    pdfLines.add(line);
                    sb.append(line + "\n");
                }
                return sb.toString();
            }
    
        }
        return null;
    }
    
    0 讨论(0)
  • 2020-12-08 05:25

    I executed your code and it worked properly. Maybe your problem is related to FilePath that you have given to file. I put my pdf in C drive and hard coded the file path. Here is my code:

    // PDFBox 2.0.8 require org.apache.pdfbox.io.RandomAccessRead
    // import org.apache.pdfbox.io.RandomAccessFile;
    
    public class PDFReader{
        public static void main(String args[]) throws IOException {
            PDFTextStripper pdfStripper = null;
            PDDocument pdDoc = null;
            File file = new File("C:/my.pdf");
            PDFParser parser = new PDFParser(new FileInputStream(file));
            parser.parse();
            try (COSDocument cosDoc = parser.getDocument()) {
                pdfStripper = new PDFTextStripper();
                pdDoc = new PDDocument(cosDoc);
                pdfStripper.setStartPage(1);
                pdfStripper.setEndPage(5);
                String parsedText = pdfStripper.getText(pdDoc);
                System.out.println(parsedText);
            }
        }
    }
    
    0 讨论(0)
  • 2020-12-08 05:25

    This works fine to extract data from a PDF file that has text content using pdfbox 2.0.6

    import java.io.File;
    import java.io.IOException;
    import org.apache.pdfbox.pdmodel.PDDocument;
    import org.apache.pdfbox.text.PDFTextStripper;
    import org.apache.pdfbox.text.PDFTextStripperByArea;
    
    public class PDFTextExtractor {
       public static void main(String[] args) throws IOException {
           System.out.println(readParaFromPDF("C:\\sample1.pdf",3, "Enter Start Text Here", "Enter Ending Text Here"));
        //Enter FilePath, Page Number, StartsWith, EndsWith
       }
       public static String readParaFromPDF(String pdfPath, int pageNo, String strStartIndentifier, String strEndIdentifier) {
           String returnString = "";
           try {
               PDDocument document = PDDocument.load(new File(pdfPath));
               document.getClass();        
               if (!document.isEncrypted()) {
                   PDFTextStripperByArea stripper = new PDFTextStripperByArea();
                   stripper.setSortByPosition(true);
                   PDFTextStripper tStripper = new PDFTextStripper();
                   tStripper.setStartPage(pageNo);
                   tStripper.setEndPage(pageNo);
                   String pdfFileInText = tStripper.getText(document);
                   String strStart = strStartIndentifier;
                   String strEnd = strEndIdentifier;
                   int startInddex = pdfFileInText.indexOf(strStart);
                   int endInddex = pdfFileInText.indexOf(strEnd);
                   returnString = pdfFileInText.substring(startInddex, endInddex) + strEnd;
               }
              } catch (Exception e) {
                  returnString = "No ParaGraph Found";
           }
                return returnString;
       }
    }
    
    0 讨论(0)
  • 2020-12-08 05:28

    PdfBox 2.0.3 has a command line tool as well.

    1. Download jar file
    2. java -jar pdfbox-app-2.0.3.jar ExtractText [OPTIONS] <inputfile> [output-text-file]
    Options:
      -password  <password>        : Password to decrypt document
      -encoding  <output encoding> : UTF-8 (default) or ISO-8859-1, UTF-16BE, UTF-16LE, etc.
      -console                     : Send text to console instead of file
      -html                        : Output in HTML format instead of raw text
      -sort                        : Sort the text before writing
      -ignoreBeads                 : Disables the separation by beads
      -debug                       : Enables debug output about the time consumption of every stage
      -startPage <number>          : The first page to start extraction(1 based)
      -endPage <number>            : The last page to extract(inclusive)
      <inputfile>                  : The PDF document to use
      [output-text-file]           : The file to write the text to
    
    0 讨论(0)
提交回复
热议问题