How to get raw text from pdf file using java

后端 未结 5 2028
谎友^
谎友^ 2020-12-02 08:31

I have some pdf files, Using pdfbox i have converted them into text and stored into text files, Now from the text files i want to remove

  1. Hyperlinks
  2. Al
5条回答
  •  情书的邮戳
    2020-12-02 09:09

    You can use iText for do such things

    //iText imports
    
    import com.itextpdf.text.pdf.PdfReader;
    import com.itextpdf.text.pdf.parser.PdfTextExtractor;
    

    for example:

    try {     
        PdfReader reader = new PdfReader(INPUTFILE);
        int n = reader.getNumberOfPages(); 
        String str=PdfTextExtractor.getTextFromPage(reader, 2); //Extracting the content from a particular page.
        System.out.println(str);
        reader.close();
    } catch (Exception e) {
        System.out.println(e);
    }
    

    another one

    try {
    
        PdfReader reader = new PdfReader("c:/temp/test.pdf");
        System.out.println("This PDF has "+reader.getNumberOfPages()+" pages.");
        String page = PdfTextExtractor.getTextFromPage(reader, 2);
        System.out.println("Page Content:\n\n"+page+"\n\n");
        System.out.println("Is this document tampered: "+reader.isTampered());
        System.out.println("Is this document encrypted: "+reader.isEncrypted());
    } catch (IOException e) {
        e.printStackTrace();
    }
    

    the above examples can only extract the text, but you need to do some more to remove hyperlinks, bullets, heading & numbers.

提交回复
热议问题