I\'m building an XML file from scratch and need to know if htmlentities() converts every character that could potentially break an XML file (and possibly UTF-8 data)?
The Gordon's answer is good and explain the XML encode problems, but not show a simple function (or what the blackbox do). Jon's answer starting well with the 'htmlspecialchars' function recomendation, but he and others do some mistake, then I will be emphatic.
A good programmer MUST have control about use or not of UTF-8 in your strings and XML data: UTF-8 (or another non-ASCII encode) IS SAFE in a consistent algorithm.
SAFE UTF-8 XML NOT NEED FULL-ENTITY ENCODE. The indiscriminate encode produce "second class, non-human-readble, encode/decode-demand, XML". And safe ASCII XML, also not need entity encode, when all your content are ASCII.
Only 3 or 4 characters need to be escaped in a string of XML content: >
, <
, &
, and optional "
.
Please read http://www.w3.org/TR/REC-xml/ "2.4 Character Data and Markup" and "4.6 Predefined Entities". THEN YOU can use 'htmlentities'
For illustration, the following PHP function will make a XML completely safe:
// it is a didactic illustration, USE htmlentities($S,flag)
function xmlsafe($s,$intoQuotes=0) {
if ($intoQuotes)
return str_replace(array('&','>','<','"'), array('&','>','<','"'), $s);
// SAME AS htmlspecialchars($s)
else
return str_replace(array('&','>','<'), array('&','>','<'), $s);
// SAME AS htmlspecialchars($s,ENT_NOQUOTES)
}
// example of SAFE XML CONSTRUCTION
function xmlTag( $element, $attribs, $contents = NULL) {
$out = '<' . $element;
foreach( $attribs as $name => $val )
$out .= ' '.$name.'="'. xmlsafe( $val,1 ) .'"';
if ( $contents==='' || is_null($contents) )
$out .= '/>';
else
$out .= '>'.xmlsafe( $contents )."$element>";
return $out;
}
In a CDATA block you not need use this function... But, please, avoid the indiscriminate use of CDATA.