How to get raw text from pdf file using java

后端 未结 5 2019
谎友^
谎友^ 2020-12-02 08:31

I have some pdf files, Using pdfbox i have converted them into text and stored into text files, Now from the text files i want to remove

  1. Hyperlinks
  2. Al
5条回答
  •  一整个雨季
    2020-12-02 09:19

    Using pdfbox we can achive this

    Example :

    public static void main(String args[]) {
    
        PDFParser parser = null;
        PDDocument pdDoc = null;
        COSDocument cosDoc = null;
        PDFTextStripper pdfStripper;
    
        String parsedText;
        String fileName = "E:\\Files\\Small Files\\PDF\\JDBC.pdf";
        File file = new File(fileName);
        try {
            parser = new PDFParser(new FileInputStream(file));
            parser.parse();
            cosDoc = parser.getDocument();
            pdfStripper = new PDFTextStripper();
            pdDoc = new PDDocument(cosDoc);
            parsedText = pdfStripper.getText(pdDoc);
            System.out.println(parsedText.replaceAll("[^A-Za-z0-9. ]+", ""));
        } catch (Exception e) {
            e.printStackTrace();
            try {
                if (cosDoc != null)
                    cosDoc.close();
                if (pdDoc != null)
                    pdDoc.close();
            } catch (Exception e1) {
                e1.printStackTrace();
            }
    
        }
    }
    

提交回复
热议问题