PDF text extraction using iText

后端未结

关注

 2  912

不要未来只要你来 2020-12-17 19:21

We are doing research in information extraction, and we would like to use iText.

We are on the process of exploring iText. According to the literature we have revie

2条回答

猫巷女王i (楼主)

2020-12-17 19:44

Like Theodore said you can extract text from a pdf and like Chris pointed out

as long as it is actually text (not outlines or bitmaps)

Best thing to do is buy Bruno Lowagie's book Itext in action. In the second edition chapter 15 covers extracting text.

But you can look at his site for examples. http://itextpdf.com/examples/iia.php?id=279

And you can parse it to create a plain txt file. Here is a code example:

/*
 * This class is part of the book "iText in Action - 2nd Edition"
 * written by Bruno Lowagie (ISBN: 9781935182610)
 * For more info, go to: http://itextpdf.com/examples/
 * This example only works with the AGPL version of iText.
 */

package part4.chapter15;

import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintWriter;

import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfReaderContentParser;
import com.itextpdf.text.pdf.parser.SimpleTextExtractionStrategy;
import com.itextpdf.text.pdf.parser.TextExtractionStrategy;

public class ExtractPageContent {

    /** The original PDF that will be parsed. */
    public static final String PREFACE = "resources/pdfs/preface.pdf";
    /** The resulting text file. */
    public static final String RESULT = "results/part4/chapter15/preface.txt";

    /**
     * Parses a PDF to a plain text file.
     * @param pdf the original PDF
     * @param txt the resulting text
     * @throws IOException
     */
    public void parsePdf(String pdf, String txt) throws IOException {
        PdfReader reader = new PdfReader(pdf);
        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        PrintWriter out = new PrintWriter(new FileOutputStream(txt));
        TextExtractionStrategy strategy;
        for (int i = 1; i <= reader.getNumberOfPages(); i++) {
            strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
            out.println(strategy.getResultantText());
        }
        reader.close();
        out.flush();
        out.close();
    }

    /**
     * Main method.
     * @param    args    no arguments needed
     * @throws IOException
     */
    public static void main(String[] args) throws IOException {
        new ExtractPageContent().parsePdf(PREFACE, RESULT);
    }
}

Notice the license

This example only works with the AGPL version of iText.

If you look at the other examples it will show how to leave out parts of the text or how to extract parts of the pdf.

Hope it helps.

0 讨论(0)

查看其它2个回答