I am using PdfBox in Java to extract text from PDF files. Some of the input files provided are not valid and PDFTextStripper halts on these files. Is there a clean way to ch
In general, we can like this, any pdf version going to finish with %%EOF so we can check like bellow.
public static boolean is_pdf(byte[] data) {
String s = new String(data);
String d = s.substring(data.length - 7, data.length - 1);
if (data != null && data.length > 4 &&
data[0] == 0x25 && // %
data[1] == 0x50 && // P
data[2] == 0x44 && // D
data[3] == 0x46 && // F
data[4] == 0x2D) { // -
if(d.contains("%%EOF")){
return true;
}
}
return false;
}