I have a rich text editor that passes HTML to the server. That HTML is then displayed to other users. I want to make sure there is no JavaScript in that HTML. Is there any w
The only way to ensure that some HTML markup does not contain any JavaScript is to filter it of all unsafe HTML tags and attributes, in order to prevent Cross-Site Scripting (XSS).
However, there is in general no reliable way of explicitly removing all unsafe elements and attributes by their names, since certain browsers may interpret ones of which you weren't even aware at the time of design, and thus open up a security hole for malicious users. This is why you're much better off taking a whitelisting approach rather than a blacklisting one. That is to say, only allow HTML tags that you are sure are safe, and stripping all others by default. Indeed, only one accidentally permitted tag can make your website vulnerable to XSS.
See this article on HTML sanitisation, which offers some specific examples of why you should whitelist rather than blacklist. Quote from that page:
Here is an incomplete list of potentially dangerous HTML tags and attributes:
script
, which can contain malicious scriptapplet
,embed
, andobject
, which can automatically download and execute malicious codemeta
, which can contain malicious redirectsonload
,onunload
, and all otheron*
attributes, which can contain malicious scriptstyle
,link
, and thestyle
attribute, which can contain malicious script
Here is another helpful page that suggests a set of HTML tags & attributes as well as CSS attributes that are typically safe to allow, as well as recommended practices.
Although many website have in the past (and currently) use the blacklisting approach, there is almost never any true need for it. (The security risks invariably outweight the potential limitations whitelisting enforces with the formatting capabilities that are granted to the user.) You need to be very aware of its flaws.
For example, this page gives a list of what are supposedly "all" the HTML tags you might want to strip out. Just from observing it briefly, you should notice that it contains a very limited number of element names; a browser could easily include a proprietary tag that unwittingly allowed scripts to run on your page, which is essentially the main problem with blacklisting.
Finally, I would strongly recommend that you utilise an HTML DOM library (such as the well-known HTML Agility Pack) for .NET, as opposed to RegEx to perform the cleaning/whitelisting, since it would be significantly more reliable. (It is quite possible to create some pretty crazy obfuscated HTML that can fool regexes! A proper HTML reader/writer makes the coding of the system much easier, anyway.)
Hopefully that should given you a decent overview of what you need to design in order to fully (or at least maximally) prevent XSS, and how it's critical that HTML sanitisation is performed with the unknown factor in mind.