How to get Unicode of the characters from PDF using java and PDFBox

与世无争的帅哥 提交于 2019-12-02 00:23:11

问题


I am using Apache PDFBox and Java to parse the PDFs and get all the information from it. Extracting text is working fine for English only. For other languages I get only some special-characters. For example extracting the Arabic character ش will give the String :"? on printing. It is working fine when I change the "Region and language" of my computer from English to Arabic. So I think extracting the Unicode of the characters will solve this problem. Please help me to get the Unicode of the characters from PDF or suggest me some solutions to solve this problem.


回答1:


Try changing the Java system locale. From your Java program, this should be equivalent to changing the OS setting.




回答2:


http://grepcode.com/file/repo1.maven.org/maven2/org.apache.pdfbox/pdfbox/1.6.0/org/apache/pdfbox/util/PDFText2HTML.java

The private String escape(String chars) converts characters to unicode.



来源:https://stackoverflow.com/questions/12577092/how-to-get-unicode-of-the-characters-from-pdf-using-java-and-pdfbox

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!