How to strip HTML attributes except “src” and “alt” in JAVA

﹥>﹥吖頭↗ 提交于 2019-12-06 14:29:39

You can:

  • Implement a SAX parser;
  • Built a document with a DOM parser, walk it and prune it and then convert back to HTML; or
  • Use an identity transform in XSLT (assuming your HTML is in XHTML format or can be converted to that with, say, JTidy) with some additional cases to remove attributes you don't want.

Whatever you do, don't try and do it with regular expressions.

OK, solved this somehow.

Used the HTMLCleaner library to parse the input data to a valid format.

Then I use a DOM parser to iterate over everything, and strip all disallowed tags and attributes.

(and some minor ugly hacks;) )

This was kind of a lot of work.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!