Questions:
What are the best safe1(), safe2(), safe3(), and safe4() functions to avoid XSS for UTF8 encoded pages? Is it also safe in all browsers (specifically IE
http://php.net/htmlentities note the section on the optional third parameter that takes a character encoding. You should use this instead of mv_convert_encoding. So long as the php file itself is saved with a utf8 encoding that should work.
htmlentities($s, ENT_COMPAT, 'UTF-8');
As for injecting the variable directly into javascript, you might consider putting the content into a hidden html element somewhere else in the page instead and pulling the content out of the dom when you need it.
The purifiers that you mention are used when you want to actually display html that a user submitted (as in, allow the browser to actually render). Using htmlentities will encode everything such that the characters will be displayed in the ui, but none of the actual code will be interpreted by the browser. Which are you aiming to do?
safe2()
is clearly htmlspecialchars()
In place of safe1()
you should really be using HTMLPurifier to sanitize complete blobs of HTML. It strips unwanted attributes, tags and in particular anything javascriptish. Yes, it's slow, but it covers all the small edge cases (even for older IE versions) which allow for safe HTML user snippet reuse. But check out http://htmlpurifier.org/comparison for alternatives. -- If you really only want to display raw user text there (no filtered html), then htmlspecialchars(strip_tags($src)) would actually work fine.
safe3()
screams regular expression. Here you can really only apply a whitelist to whatever you actually want:
var a = "<?php echo preg_replace('/[^-\w\d .,]/', "", $xss)?>";
You can of course use json_encode
here to get a perfectly valid JS syntax and variable. But then you've just delayed the exploitability of that string into your JS code, where you then have to babysit it.
Is it also safe in all browsers (specifically IE6)?
If you specify the charset explicitly, then IE won't do its awful content detection magic, so UTF7 exploits can be ignored.