Handling special entities like & nbsp; , & pound; in HtmlCleaner

筅森魡賤 提交于 2020-01-06 04:33:29

问题


I am using HtmlCleaner library for html content extraction. It works fairly but with few limitations.

It is not able to handle special characters like &pound or quotes etc. For e.x. for url : http://www.basicelegancefurnishings.co.uk/alaska-3-and-2-seater-sofa-setspan-classukmadespan-p-280.html, On giving xpath to price, It gives me "& pound;" inplace of £

Is there any property which we can set in htmlcleaner for handling this or any other solution.

Thanks

Jitendra


回答1:


No, I don't believe HtmlCleaner can do this. However, you can use Apache Commons StringEscapeUtils to "unescape" the html, like this:

StringEscapeUtils.unescapeHtml("£679.00");

will produce £679.00.

Instead of HtmlCleaner, I would recommend you try JSoup.




回答2:


The version of htmlcleaner I am using is 2.2, and org.htmlcleaner.CleanerProperties - setTransSpecialEntitiesToNCR(true) is useful to me. While I have to use the string.replace(" ", " ") to make the html content I got be right completely.




回答3:


This can now be done through org.htmlcleaner.CleanerProperties - setTransSpecialEntitiesToNCR(true).



来源:https://stackoverflow.com/questions/4315979/handling-special-entities-like-nbsp-pound-in-htmlcleaner

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!