How to strip HTML attributes except “src” and “alt” in JAVA

自作多情 提交于 2019-12-10 11:31:22

问题


How do I strip all attributes from HTML tags in a string, except "alt" and "src" using Java?

And further.. how do I get the content from all "src" attributes in the string?

:)


回答1:


You can:

  • Implement a SAX parser;
  • Built a document with a DOM parser, walk it and prune it and then convert back to HTML; or
  • Use an identity transform in XSLT (assuming your HTML is in XHTML format or can be converted to that with, say, JTidy) with some additional cases to remove attributes you don't want.

Whatever you do, don't try and do it with regular expressions.




回答2:


OK, solved this somehow.

Used the HTMLCleaner library to parse the input data to a valid format.

Then I use a DOM parser to iterate over everything, and strip all disallowed tags and attributes.

(and some minor ugly hacks;) )

This was kind of a lot of work.



来源:https://stackoverflow.com/questions/560605/how-to-strip-html-attributes-except-src-and-alt-in-java

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!