Jsoup having problems with special HTML symbols, ‘ — etc

做~自己de王妃 提交于 2019-12-06 03:32:05

问题


I have some HTML (String) that I am putting through Jsoup just so I can add something to all href and src attributes, that works fine. However, I'm noticing that for some special HTML characters, Jsoup is converting them from say “ to the actual character . I output the value before and after and I see that change.

Before:

THIS — IS A “TEST”. 5 > 4. trademark: ™

After:

THIS — IS A “TEST”. 5 > 4. trademark: ?

What the heck is going on? I was specifically converting those special characters to their HTML entities before any Jsoup stuff to avoid this. The quotes changed to the actual quote characters, the greater-than stayed the same, and the trademark changed into a question mark. Aaaaaaa.

FYI, my Jsoup code is doing:

Document document = Jsoup.parse(fileHtmlStr);
//some stuff
String modifiedFileHtmlStr = document.html();

Thanks for any help!


回答1:


The code below will give similar to the input markup. It changes the escaping mode for specific characters and sets ASCII mode to escape the TM sign for systems which don't support Unicode.

The output:

<p>THIS &mdash; IS A &ldquo;TEST&rdquor;&period; 5 &gt; 4&period; trademark&colon; &#x99;</p>

The code:

Document doc = Jsoup.parse("" +
    "<p>THIS &mdash; IS A &ldquo;TEST&rdquo;. 5 &gt; 4. trademark: &#153;</p>");

Document.OutputSettings settings = doc.outputSettings();

settings.prettyPrint(false);
settings.escapeMode(Entities.EscapeMode.extended);
settings.charset("ASCII");

String modifiedFileHtmlStr = doc.html();

System.out.println(modifiedFileHtmlStr);


来源:https://stackoverflow.com/questions/18919731/jsoup-having-problems-with-special-html-symbols-lsquo-mdash-etc

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!