Text extracted by PDFBox does not contain international (non-English) characters

杀马特。学长 韩版系。学妹 提交于 2019-12-24 02:22:24

问题


I'm using Apache PDFBox to extract text from several PDF files. The files are in Polish language and they contain Polish characters. Unfortunately, when I print the extracted text, I keep getting ? (question marks) instead of those characters.


回答1:


Assuming your extracted text is stored in String s, I am assuming that you are currently using this to print -

System.out.println(s);

I suggest you use this snippet for printing out the polish characters properly-

java.io.PrintStream p = new java.io.PrintStream(System.out,false,"UTF-8");
p.println(s);

This should work and ? will not appear in the printed text.



来源:https://stackoverflow.com/questions/11496395/text-extracted-by-pdfbox-does-not-contain-international-non-english-characters

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!