Use jsoup to parse XML - prevent jsoup from “cleaning” tags

Use jsoup to parse XML - prevent jsoup from “cleaning” <link> tags

问题

In most case, I have no problem with using jsoup to parse XML. However, if there are <link> tags in the XML document, jsoup will change <link>some text here</link> to <link />some text here. This makes it impossible to extract text inside the <link> tag using CSS selector.

So how to prevent jsoup from "cleaning" <link> tags?

回答1:

In jsoup 1.6.2 I have added an XML parser mode, which parses the input as-is, without applying the HTML5 parse rules (contents of element, document structure, etc). This mode will keep text in a <link> tag, and allow multiples of it, etc.

Here's an example:

String xml = "<link>One</link><link>Two</link>";
Document xmlDoc = Jsoup.parse(xml, "", Parser.xmlParser());

Elements links = xmlDoc.select("link");
System.out.println("Link text 1: " + links.get(0).text());
System.out.println("Link text 2: " + links.get(1).text());

Returns:

Link text 1: One
Link text 2: Two

回答2:

Do not store any text inside <link> element - it's invalid. If you need extra information, keep it inside HTML5 data-* attributes. I'm sure jsoup won't touch it.

<link rel="..." data-city="Warsaw" />

回答3:

There can be a workaround for this. Before passing XML to jsoup. Transform XML file to replace all with some dummy tag say and do what you want to do.

来源：https://stackoverflow.com/questions/6722307/use-jsoup-to-parse-xml-prevent-jsoup-from-cleaning-link-tags

标签

java

xml-parsing

jsoup

link-tag

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!