As most (all?) PHP libraries that do HTML sanitization such as HTML Purifier are heavily dependant on regex, I thought trying to write a HTML sanitizer that uses the DOMDocu
For a start, you can have a look at this custom RecursiveDomIterator:
Code:
class RecursiveDOMIterator implements RecursiveIterator
{
/**
* Current Position in DOMNodeList
* @var Integer
*/
protected $_position;
/**
* The DOMNodeList with all children to iterate over
* @var DOMNodeList
*/
protected $_nodeList;
/**
* @param DOMNode $domNode
* @return void
*/
public function __construct(DOMNode $domNode)
{
$this->_position = 0;
$this->_nodeList = $domNode->childNodes;
}
/**
* Returns the current DOMNode
* @return DOMNode
*/
public function current()
{
return $this->_nodeList->item($this->_position);
}
/**
* Returns an iterator for the current iterator entry
* @return RecursiveDOMIterator
*/
public function getChildren()
{
return new self($this->current());
}
/**
* Returns if an iterator can be created for the current entry.
* @return Boolean
*/
public function hasChildren()
{
return $this->current()->hasChildNodes();
}
/**
* Returns the current position
* @return Integer
*/
public function key()
{
return $this->_position;
}
/**
* Moves the current position to the next element.
* @return void
*/
public function next()
{
$this->_position++;
}
/**
* Rewind the Iterator to the first element
* @return void
*/
public function rewind()
{
$this->_position = 0;
}
/**
* Checks if current position is valid
* @return Boolean
*/
public function valid()
{
return $this->_position < $this->_nodeList->length;
}
}
You can use that in combination with a RecursiveIteratorIterator
. Usage examples are on the page.
In general though, it would be easier to use XPath to search for blacklisted nodes instead of traversing the DOM Tree. Also keep in mind that DOM is already quite good at preventing XSS by automatically escaping xml entities in nodeValues.
The other thing you have to be aware of is that any manipulation of a DOMDocument will immediately affect any DOMNodeList you might have from XPath queries and that might lead to skipped nodes when manipulating them. See DOMNode replacement with PHP's DOM classes for an example.