问题
How do I strip all attributes from HTML tags in a string, except "alt" and "src" using Java?
And further.. how do I get the content from all "src" attributes in the string?
:)
回答1:
You can:
- Implement a SAX parser;
- Built a document with a DOM parser, walk it and prune it and then convert back to HTML; or
- Use an identity transform in XSLT (assuming your HTML is in XHTML format or can be converted to that with, say, JTidy) with some additional cases to remove attributes you don't want.
Whatever you do, don't try and do it with regular expressions.
回答2:
OK, solved this somehow.
Used the HTMLCleaner library to parse the input data to a valid format.
Then I use a DOM parser to iterate over everything, and strip all disallowed tags and attributes.
(and some minor ugly hacks;) )
This was kind of a lot of work.
来源:https://stackoverflow.com/questions/560605/how-to-strip-html-attributes-except-src-and-alt-in-java