Keep XML entities in output (jSoup)

陌路散爱 提交于 2019-12-24 01:45:28

问题


I'm using jsoup to do some xml processing. Problem is, it is replacing xml entities, ie.: » with html entities: »

How could I keep original (xml) entities?

Groovy script:

import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.jsoup.nodes.Entities
import org.jsoup.parser.Parser

String HTML_STRING = '''
    <html>
    <div></div>
    <div>Some text &#187;</div>
    </html>
  '''

Document doc = Jsoup.parse(new ByteArrayInputStream(HTML_STRING.getBytes("UTF-8")), "UTF-8", "", Parser.xmlParser())
doc.outputSettings().charset("UTF-8")
doc.outputSettings().escapeMode(Entities.EscapeMode.base)


println doc.toString()

Result:

<html> 
 <div></div> 
 <div>
  Some text &raquo;
 </div> 
</html>

If I use Entities.EscapeMode.xhtml the result is:

<html> 
 <div></div> 
 <div>
  Some text »
 </div> 
</html>

Thanks.


回答1:


You want to use a combination of EscapeMode.xhtml (which is the default if you use the XML parser, not the HTML parser), and ascii as the output character set.

The default output charset is UTF-8, and jsoup will prefer to not use entity escapes if the output charset supports the character directly (because why waste CPU and bandwidth with unnecessary escapes).

If you change the output charset to ascii using Document.OutputSettings.charset("ascii") you'll get the output you want.

You also probably want to set the output syntax to XML if you are working with HTML, as otherwise the HTML parser will try to make the output confirm to HTML and can munge your XML DOM tree.

(Source: author of jsoup)



来源:https://stackoverflow.com/questions/19656463/keep-xml-entities-in-output-jsoup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!