What\'s an example of something dangerous that would not be caught by the code below?
EDIT: After some of the comments I added another line, commented below. See V
Although I can't provide a specific example of why not, I am going to go ahead and outright say no. This is more on principal. Regex's are an amazing tool but they should only be used for certain problems. They are fantastic for data matching and searching.
They are not however a good tool for security. It is too easy to mess up a regex and have it be only partially correct. Hackers can find lots of wiggle room inside a poorly or even well constructed regex. I would try another avenue to prevent cross site scripting.
For example javascript:
pseudo-URL can be obfuscated with HTML entities, you've forgotten about <embed>
and there are dangerous CSS properties like behavior
and expression
in IE.
There are countless ways to evade filters and such approach is bound to fail. Even if you find and block all exploits possible today, new unsafe elements and attributes may be added in the future.
There are only two good ways to secure HTML:
convert it to text by replacing every <
with <
.
If you want to allow users enter formatted text, you can use your own markup (e.g. markdown like SO does).
parse HTML into DOM, check every element and attribute and remove everything that is not whitelisted.
You will also need to check contents of allowed attributes like href
(make sure that URLs use safe protocol, block all unknown protocols).
Once you've cleaned up the DOM, generate new, valid HTML from it. Never work on HTML as if it was text, because invalid markup, comments, entities, etc. can easily fool your filter.
Also make sure your page declares its encoding, because there are exploits that take advantage of browsers auto-detecting wrong encoding.
Take a look at the XSS cheatsheet at http://ha.ckers.org/xss.html it's not a complete list but a good start.
One that comes to mind is <img src="http://badsite.com/javascriptfile" />
You also forgot onmouseover, and the style tag.
The easiest thing to do really is entity escaping. If the vector can't render properly in the first place, an incomplete blacklist won't matter.
I still have not figured out why developers want to massage bad input into good input with a regular expression replace. Unless your site is a blog and needs to allow embedded html or javascript or any other sort of code, reject the bad input and return an error. The old saying is Garbage In - Garbage Out, why would you want to take in a nice steaming pile of poo and make it edible?
If your site is not internationalized, why accept any unicode?
If your site only does POST, why accept any URL encoded values?
Why accept any hex? Why accept html entities? What user inputs '
' or '&quot;' ?
As for regular expressions, using them is fine, however, you do not have to code a separate regular expression for the full attack string. You can reject many different attack signatures with just a few well constructed regex patterns:
patterns.put("xssAttack1", Pattern.compile("<script",Pattern.CASE_INSENSITIVE) );
patterns.put("xssAttack2", Pattern.compile("SRC=",Pattern.CASE_INSENSITIVE) );
patterns.put("xssAttack3", Pattern.compile("pt:al",Pattern.CASE_INSENSITIVE) );
patterns.put("xssAttack4", Pattern.compile("xss",Pattern.CASE_INSENSITIVE) );
<FRAMESET><FRAME SRC="javascript:alert('XSS');"></FRAMESET>
<DIV STYLE="width: expression(alert('XSS'));">
<LINK REL="stylesheet" HREF="javascript:alert('XSS');">
<IMG SRC="jav ascript:alert('XSS');"> // hmtl allows embedded tabs...
<IMG SRC="jav
ascript:alert('XSS');"> // hmtl allows embedded newline...
<IMG SRC="jav
ascript:alert('XSS');"> // hmtl allows embedded carriage return...
Notice that my patterns are not the full attack signature, just enough to detect if the value is malicious. It is unlikely that a user would enter 'SRC=' or 'pt:al' This allows my regex patterns to detect unknown attacks that have any of these tokens in them.
Many developers will tell you that you cannot protect a site with a blacklist. Since the set of attacks is infinite, that is basically true, however, if you parse the entire request (params, param values, headers, cookies) with a blacklist constructed based on tokens, you will be able to figure out what is an attack and what is valid. Remember, the attacker will most likely be shotgunning exploits at you from a tool. If you have properly hardened your server, he will not know what environment you are running and will have to blast you with lists of exploits. If he pesters you enough, put the attacker, or his IP on a quarantine list. If he has a tool with 50k exploits ready to hit your site, how long will it take him if you quarantine his id or ip for 30 min for each violation? Admittedly there is still exposure if the attacker uses a botnet to multiplex his attack. Still your site ends up being a much tougher nugget to crack.
Now having checked the entire request for malicious content you can now use whitelist type checks against length, referential/ logical, naming to determine validity of the request
Don't forget to implement some sort of CSRF protection. Maybe a honey token, and check the user-agent string from previous requests to see if it has changed.
From a different point of view, what happens when someone wants to have 'javascript' or 'functionload' or 'visionblurred' in what they submit? This can happen in most places for any number of reasons... From what I understand, those will become 'javaSAFEscript', 'functionSAFEload' and 'visionSAFEblurred'(!!).
If this might apply to you, and you're stuck with the blacklist approach, be sure to use the exact matching regexes to avoid annoying the user. In other words, be at the optimum point between security and usability, compromising either as little as possible.
You're much better off turning all <
into <
and all >
into >
, then converting acceptable tags back. In other words, whitelist, don't blacklist.