问题
I try to extract text from a large pdf, but i only get the first pages, i need all text to will be passed to a string variable.
This is the code
public class ParsePDF {
public static void main(String args[]) throws Exception {
try {
File file = new File("C:/vlarge.pdf");
String content = new Tika().parseToString(file);
System.out.println("The Content: " + content);
}
catch (Exception e) {
e.printStackTrace();
}
}
}
回答1:
From the Javadocs:
To avoid unpredictable excess memory use, the returned string contains only up to getMaxStringLength() first characters extracted from the input document. Use the setMaxStringLength(int) method to adjust this limitation.
Calling setMaxStringLength(-1)
will disable this limit.
回答2:
Try the apache api TIKA. Its working for large PDF's also.
Sample :
InputStream input = new FileInputStream("sample.pdf");
ContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
Metadata metadata = new Metadata();
new PDFParser().parse(input, handler, metadata, new ParseContext());
String plainText = handler.toString();
System.out.println(plainText);
来源:https://stackoverflow.com/questions/19074191/extract-text-from-a-large-pdf-with-tika