Jsoup unescapes special characters

折月煮酒 提交于 2019-12-17 21:03:20

问题


I'm using Jsoup to remove all the images from an HTML page. I'm receiving the page through an HTTP response - which also contains the content charset.

The problem is that Jsoup unescapes some special characters.

For example, for the input:

<html><head></head><body><p>isn&rsquo;t</p></body></html>

After running

String check = "<html><head></head><body><p>isn&rsquo;t</p></body></html>";
Document doc = Jsoup.parse(check);
System.out.println(doc.outerHtml());

I get:

<html><head></head><body><p>isn’t</p></body></html><p></p>

I want to avoid changing the html in any other way except for removing the images.

By using the command:

doc.outputSettings().prettyPrint(false).charset("ASCII").escapeMode(EscapeMode.extended);

I do get the correct output but I'm sure there are cases where that charset won't be good. I just want to use the charset specified in the HTTP header and I'm afraid this will change my document in ways I can't predict. Is there any other cleaner method for removing the images without changing anything else inadvertently?

Thank you!


回答1:


Here is a workaround not involving any charset except the one specified in the HTTP header.

String check = "<html><head></head><body><p>isn&rsquo;t</p></body></html>".replaceAll("&([^;]+?);", "**$1;");

Document doc = Jsoup.parse(check);

doc.outputSettings().prettyPrint(false).escapeMode(EscapeMode.extended);

System.out.println(doc.outerHtml().replaceAll("\\*\\*([^;]+?);", "&$1;"));

OUTPUT

<html><head></head><body><p>isn&rsquo;t</p></body></html>

DISCUSSION

I wish there was a solution in Jsoup's API - @dlv

Using Jsoup'API would require you to write a custom NodeVisitor. It would leads to (re)inventing some existing code inside Jsoup. The custom Nodevisitor would generate back an HTML escape code instead of a unicode character.

Another option would involve writing a custom character encoder. The default UTF-8 character encoder can encode &rsquo;. This is why Jsoup doesn't preserve the original escape sequence in the final HTML code.

Any of the two above options represents a big coding effort. Ultimately, an enhancement could be added to Jsoup for letting us choose how to generate the characters in the final HTML code : hexadecimal escape (&#AB;), decimal escape (&#151;), the original escape sequence (&rsquo;) or write the encoded character (which is the case in your post).



来源:https://stackoverflow.com/questions/34368908/jsoup-unescapes-special-characters

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!