Extract text from a large pdf with Tika

问题

I try to extract text from a large pdf, but i only get the first pages, i need all text to will be passed to a string variable.

This is the code

public class ParsePDF {
    public static void main(String args[]) throws Exception {


    try {

      File file = new File("C:/vlarge.pdf");

      String content = new Tika().parseToString(file);

      System.out.println("The Content: " + content);

        }
        catch (Exception e) {
          e.printStackTrace();
        }
    }
}

回答1:

From the Javadocs:

To avoid unpredictable excess memory use, the returned string contains only up to getMaxStringLength() first characters extracted from the input document. Use the setMaxStringLength(int) method to adjust this limitation.

Calling setMaxStringLength(-1) will disable this limit.

回答2:

Try the apache api TIKA. Its working for large PDF's also.

Sample :

        InputStream input = new FileInputStream("sample.pdf");
        ContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
        Metadata metadata = new Metadata();
        new PDFParser().parse(input, handler, metadata, new ParseContext());
        String plainText = handler.toString();
        System.out.println(plainText);

来源：https://stackoverflow.com/questions/19074191/extract-text-from-a-large-pdf-with-tika

标签

java

pdf

extract

apache-tika

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!