How to preserve case in jsoup parsing?

家住魔仙堡 提交于 2021-01-27 07:21:41

问题


I am using jsoup to parse some HTML content. After parsing the HTML content, it changes the camel cased attributes to lowercase like <svg viewBox='XXXX'> to <svg viewbox='XXXX'>.

Can someone suggest me how i can preserve the case while parsing html content using jsoup 1.8.1?


回答1:


I just released jsoup 1.10.1 which includes support for preserving tag and/or attribute case. You can control it with ParseSettings. By default the HTML parser will continue to lower case normalize tags and attributes, and the XML parser will preserve them. You can specify these settings when you create the parser.

To use the XML parser (which preserves case by default):

Document doc = Jsoup.parse(xml, baseUrl, Parser.xmlParser());

To use the HTML parser and set it to preserve-case:

Parser parser = Parser.htmlParser();
parser.settings(new ParseSettings(true, true)); // tag, attribute preserve case
Document doc = parser.parseInput(html, baseUrl);



回答2:


It can be quite difficult to preserve attribute's name character case when parsing document. The line responsible for converting all attributes names to lowercase is TokeniserState.java#649 as for JSoup 1.8.2, and there's no space to insert user's custom code.

The most you can do is to download sources, modify the line and build your own copy of library.

You should also consider if it would not introduce some strange behaviour if you did't convert attributes' names to lowercase. Maybe some problems with Document.getElementByAttribute or even other dependant functions?



来源:https://stackoverflow.com/questions/31400712/how-to-preserve-case-in-jsoup-parsing

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!