问题
I am using HtmlCleaner library in order to parse/convert HTML files in java.
It seems that is not able to handle Spanish characters like 'ÁáÉéÍíÑñÓóÚúÜü'
Is there any property which I can set in HtmlCleaner for handling this or any other solution? Here's the code I'm using to invoke it:
CleanerProperties props = new CleanerProperties();
props.setRecognizeUnicodeChars(true);
java.io.File file = new java.io.File("C:\\example.html");
TagNode tagNode = new HtmlCleaner(props).clean(file);
回答1:
HtmlCleaner uses the default character set read from the JVM unless specified. On Windows this will be Cp1512 not UTF-8, which is probably where it's going wrong.
You can either
- specify
-Dfile.encoding=UTF-8
on your JVM start line use the
HtmlCleaner.clean()
overload that accepts a character setTagNode tagNode = new HtmlCleaner(props).clean(file, "UTF-8");
(if you've got Google Guava in the project you can use
Charsets.UTF_8
for the constant)- use the
HtmlCleaner.clean()
overload that accepts an InputStreamReader which you've already constructed with the correct character set.
回答2:
You can change UTF-8
to UTF-16
.
It will support maximum number of characters.
来源:https://stackoverflow.com/questions/10299651/htmlcleaner-handle-spanish-characters