HTMLCLEANER handle Spanish characters

孤者浪人 提交于 2019-12-06 05:30:00

问题


I am using HtmlCleaner library in order to parse/convert HTML files in java.

It seems that is not able to handle Spanish characters like 'ÁáÉéÍíÑñÓóÚúÜü'

Is there any property which I can set in HtmlCleaner for handling this or any other solution? Here's the code I'm using to invoke it:

CleanerProperties props = new CleanerProperties();
props.setRecognizeUnicodeChars(true);
java.io.File file = new java.io.File("C:\\example.html");
TagNode tagNode = new HtmlCleaner(props).clean(file);

回答1:


HtmlCleaner uses the default character set read from the JVM unless specified. On Windows this will be Cp1512 not UTF-8, which is probably where it's going wrong.

You can either

  • specify -Dfile.encoding=UTF-8 on your JVM start line
  • use the HtmlCleaner.clean() overload that accepts a character set

    TagNode tagNode = new HtmlCleaner(props).clean(file, "UTF-8");
    

    (if you've got Google Guava in the project you can use Charsets.UTF_8 for the constant)

  • use the HtmlCleaner.clean() overload that accepts an InputStreamReader which you've already constructed with the correct character set.



回答2:


You can change UTF-8 to UTF-16.

It will support maximum number of characters.



来源:https://stackoverflow.com/questions/10299651/htmlcleaner-handle-spanish-characters

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!