I\'m trying to code a secure and lightweight white-list based HTML purifier which will use DOMDocument. In order to avoid unnecessary complexity I am willing to make the fol
You mention href
and action
as places javascript:
URLs can appear, but you're missing the src
attribute among a bunch of other URL loading attributes.
Line 399 of the OWASP Java HTMLPolicyBuilder is the definition of URL attributes in a white-listing HTML sanitizer.
private static final Set
URL_ATTRIBUTE_NAMES = ImmutableSet.of( "action", "archive", "background", "cite", "classid", "codebase", "data", "dsync", "formaction", "href", "icon", "longdesc", "manifest", "poster", "profile", "src", "usemap");
The HTML5 Index contains a summary of attribute types. It doesn't mention some conditional things like but if you scan that list for valid URL and friends, you should get a decent idea of what HTML5 adds. The set of HTML 4 attributes with type
%URI
is also informative.
Your protocol whitelist looks very similar to the OWASP sanitizer one. The addition of ftp
and sftp
looks innocuous enough.
A good source of security related schema info for HTML element and attributes is the Caja JSON whitelists which are used by the Caja JS HTML sanitizer.
How are you planning on rendering the resulting DOM? If you're not careful, then even if you strip out all the elements, an attacker might get a buggy renderer to produce content that a browser interprets as containing a
element. Consider the valid HTML that does not contain a script element.
A buggy renderer might output the contents of this as:
which does contain a script element.
(Full disclosure: I wrote chunks of both HTML sanitizers mentioned above.)