I\'m generating an XML document from a PHP script and I need to escape the XML special characters. I know the list of characters that should be escaped; but what is the corr
In order to have a valid final XML text, you need to escape all XML entities and have the text written in the same encoding as the XML document processing-instruction states it (the "encoding" in the line). The accented characters don't need to be escaped as long as they are encoded as the document.
However, in many situations simply escaping the input with htmlspecialchars
may lead to double-encoded entities (for example é
would become é
), so I suggest decoding html entities first:
function xml_escape($s)
{
$s = html_entity_decode($s, ENT_QUOTES, 'UTF-8');
$s = htmlspecialchars($s, ENT_QUOTES, 'UTF-8', false);
return $s;
}
Now you need to make sure all accented characters are valid in the XML document encoding. I strongly encourage to always encode XML output in UTF-8, since not all the XML parsers respect the XML document processing-instruction encoding. If your input might come from a different charset, try using utf8_encode()
.
There's a special case, which is your input may come from one of these encodings: ISO-8859-1, ISO-8859-15, UTF-8, cp866, cp1251, cp1252, and KOI8-R -- PHP treats them all the same, but there are some slight differences in them -- some of which even iconv()
cannot handle. I could only solve this encoding issue by complementing utf8_encode()
behavior:
function encode_utf8($s)
{
$cp1252_map = array(
"\xc2\x80" => "\xe2\x82\xac",
"\xc2\x82" => "\xe2\x80\x9a",
"\xc2\x83" => "\xc6\x92",
"\xc2\x84" => "\xe2\x80\x9e",
"\xc2\x85" => "\xe2\x80\xa6",
"\xc2\x86" => "\xe2\x80\xa0",
"\xc2\x87" => "\xe2\x80\xa1",
"\xc2\x88" => "\xcb\x86",
"\xc2\x89" => "\xe2\x80\xb0",
"\xc2\x8a" => "\xc5\xa0",
"\xc2\x8b" => "\xe2\x80\xb9",
"\xc2\x8c" => "\xc5\x92",
"\xc2\x8e" => "\xc5\xbd",
"\xc2\x91" => "\xe2\x80\x98",
"\xc2\x92" => "\xe2\x80\x99",
"\xc2\x93" => "\xe2\x80\x9c",
"\xc2\x94" => "\xe2\x80\x9d",
"\xc2\x95" => "\xe2\x80\xa2",
"\xc2\x96" => "\xe2\x80\x93",
"\xc2\x97" => "\xe2\x80\x94",
"\xc2\x98" => "\xcb\x9c",
"\xc2\x99" => "\xe2\x84\xa2",
"\xc2\x9a" => "\xc5\xa1",
"\xc2\x9b" => "\xe2\x80\xba",
"\xc2\x9c" => "\xc5\x93",
"\xc2\x9e" => "\xc5\xbe",
"\xc2\x9f" => "\xc5\xb8"
);
$s=strtr(utf8_encode($s), $cp1252_map);
return $s;
}