I want to allow a lot of user submitted html for user profiles, I currently try to filter out what I don\'t want but I am now wanting to change and use a whitelist approach.
You can just use the strip_tags() function
Since the function is defined as
string strip_tags ( string $str [, string $allowable_tags ] )
You can do this:
$html = $_POST['content'];
$html = strip_tags($html, '<b><a><i><u><span>');
But take note that using strip_tags, you won't be able to filter off the attributes. e.g.
<a href="javascript:alert('haha caught cha!');">link</a>
HTML Purifier is the best HTML parser/cleaner out there.
It's a pretty simple aim to achieve actually - you just need to check for anything that's NOT some tags from a list of whitelisted tags and remove them from the source. It can be done quite easily with one regex.
function sanitize($html) {
$whitelist = array(
'b', 'i', 'u', 'strong', 'em', 'a'
);
return preg_replace("/<(^".implode("|", $whitelist).")(.*)>(.*)<\/(^".implode("|", $whitelist).")>/", "", $html);
}
I haven't tested this, and there's probably an error in there somewhere but you get the gist of how it works. You might also want to look at using a formatting language such as Textile or Markdown.
Jamie
HTML Purifier is a standards-compliant HTML filter library written in PHP. HTML Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist, it will also make sure your documents are standards compliant, something only achievable with a comprehensive knowledge of W3C's specifications.
Try this function "getCleanHTML" below, extract text content from the elements with exceptions of elements with tag name in the whitelist. This code is clean and easy to understand and debug.
<?php
$TagWhiteList = array(
'b', 'i', 'u', 'strong', 'em', 'a', 'img'
);
function getHTMLCode($Node) {
$Document = new DOMDocument();
$Document->appendChild($Document->importNode($Node, true));
return $Document->saveHTML();
}
function getCleanHTML($Node, $Text = "") {
global $TagWhiteList;
$TextName = $Node->tagName;
if ($TextName == null)
return $Text.$Node->textContent;
if (in_array($TextName, $TagWhiteList))
return $Text.getHTMLCode($Node);
$Node = $Node->firstChild;
if ($Node != null)
$Text = getCleanHTML($Node, $Text);
while($Node->nextSibling != null) {
$Text = getCleanHTML($Node->nextSibling, $Text);
$Node = $Node->nextSibling;
}
return $Text;
}
$Doc = new DOMDocument();
$Doc->loadHTMLFile("Test.html");
echo getCleanHTML($Doc->documentElement)."\n";
?>
Hope this helps.
Maybe it is safer to use DOMDocument to analyze it correctly, remove disallowed tags with removeChild() and then get the result. It is not always safe to filter stuff with regular expressions, specially if things start to get such complexity. Hackers can find a way to cheat your filters, forums and social networks do know that very well.
For instance, browsers ignore spaces after the <. Your regex filter <script, but if I use < script... big FAIL!