Extracting Hebrew text from PDF using apache pdfbox does not return all characters
问题 The code below extracts Hebrew text from http://www.language-brain.com/journal/docs/Gvion_Friedmann_LanguageBrain7_frigvi.pdf without the Hebrew character "ן". All other text seems to be extracted fine. Any ideas? public class TestPDFUtil { @Test public void testHebrewPDF() throws Exception { String url = "http://www.language-brain.com/journal/docs/Gvion_Friedmann_LanguageBrain7_frigvi.pdf"; String text = PDFUtil.readPDF(url); System.out.println(text); Assert.assertTrue(text.indexOf("זיכרון