jsoup - stop jsoup from making quotes into &

若如初见. 提交于 2020-01-16 04:12:36

问题


When I parse local HTML files jsoup changes quotes inside an anchor element to & obscuring my HTML.

let's assume i want to change the value "one" to "two" in the following HTML part:

<div class="pg2-txt1">
  <a class="foo" appareantly_a_javascript_statement='{"targetId":"pg1-magn1", "ordinal":1}'>one</a>
</div>

what I get is:

<div class="pg2-txt1">
  <a class="foo" appareantly_a_javascript_statement="{&quot;targetId&quot;:&quot;pg1-magn1&quot;, &quot;ordinal&quot;:1}">two</a>
</div>

The quotes inside the anchor element are needed. My code looks like this now:

File input = new File("D:/javatest/page02.html");
Document doc = Jsoup.parse(input, "UTF-8");
Element div = doc.select("div.pg2-txt1").first(); //anchor element only identifyable by parent <div> class
div.child(0).text("one"); //actual anchor element

I tried

doc.outputSettings().prettyPrint(false);

with no success.

Can I achieve this with jsoup? Do I have to use a different parser and how would that look like.

Thank you very much in advance.


回答1:


According to the html spec JSoup behaves totally fine:

By default, SGML requires that all attribute values be delimited using either double quotation marks (ASCII decimal 34) or single quotation marks (ASCII decimal 39). Single quote marks can be included within the attribute value when the value is delimited by double quote marks, and vice versa. Authors may also use numeric character references to represent double quotes (") and single quotes ('). For double quotes authors can also use the character entity reference &quot;

Note the last sentence!

Basically that means, that your other software that needs the double quotes in the appareantly_a_javascript_statement attribute is doing some incomplete parsing of its value.

I see two solutions:

1) modify the function that interprets the appareantly_a_javascript_statement value

I can't help you there, since I have no knowledge of where it is done.

2) Change the Jsoup output via regular expressions.

This is pretty hacky...

String html = doc.outerHtml();
boolean changed = false;
html = html.replaceAll("(=\"\\{)([^\"]+)(\")", "='{$2'");
do{
    int oldLength = html.length();  
    html = html.replaceAll("(=')([^']+)(\\&quot;)([^\']+)(')", "$1$2\"$4$5");
    changed = html.length() != oldLength;
}while(changed);
System.out.print(html);


来源:https://stackoverflow.com/questions/24145426/jsoup-stop-jsoup-from-making-quotes-into-amp

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!