I\'m looking for best practices for performing strict (whitelist) validation/filtering of user-submitted HTML.
Main purpose is to filter out XSS and similar nasties
The W3C has a big open-source package for validating HTML available here:
http://validator.w3.org/
You can download the package for yourself and probably implement whatever they're doing. Unfortunately, it seems like a lot of DOM parsers seem to be willing to bend the rules to allot for HTML code "in the wild" as it were, so it's a good idea to let the masters tell you what's wrong and not leave it to a more practical tool--there are a lot of websites out there that aren't perfect, compliant HTML but that we still use every day.