How to get raw text from pdf file using java

后端未结

关注

 5  2019

谎友^ 2020-12-02 08:31

I have some pdf files, Using pdfbox i have converted them into text and stored into text files, Now from the text files i want to remove

Hyperlinks
Al

5条回答

一整个雨季 (楼主)

2020-12-02 09:19

Using pdfbox we can achive this

Example :

public static void main(String args[]) {

    PDFParser parser = null;
    PDDocument pdDoc = null;
    COSDocument cosDoc = null;
    PDFTextStripper pdfStripper;

    String parsedText;
    String fileName = "E:\\Files\\Small Files\\PDF\\JDBC.pdf";
    File file = new File(fileName);
    try {
        parser = new PDFParser(new FileInputStream(file));
        parser.parse();
        cosDoc = parser.getDocument();
        pdfStripper = new PDFTextStripper();
        pdDoc = new PDDocument(cosDoc);
        parsedText = pdfStripper.getText(pdDoc);
        System.out.println(parsedText.replaceAll("[^A-Za-z0-9. ]+", ""));
    } catch (Exception e) {
        e.printStackTrace();
        try {
            if (cosDoc != null)
                cosDoc.close();
            if (pdDoc != null)
                pdDoc.close();
        } catch (Exception e1) {
            e1.printStackTrace();
        }

    }
}

0 讨论(0)

查看其它5个回答