I\'m needing to load an XML document into PHP that comes from an external source. The XML does not declare it\'s encoding and contains illegal characters like &
If tidy extension is not an option, you may consider htmlpurifier.
Try using the Tidy library which can be used to clean up bad HTML and XML http://php.net/manual/en/book.tidy.php
A pure PHP solution to fix some XML like this:
<?xml version="1.0"?>
<feed>
<RECORD>
<ID>117387</ID>
<ADVERTISERNAME>Test < texter</ADVERTISERNAME>
<AID>10544740</AID>
<NAME>This & This</NAME>
<DESCRIPTION>For one day only this is > than this.</DESCRIPTION>
</RECORD>
</feed>
Would be something like this:
function cleanupXML($xml) {
$xmlOut = '';
$inTag = false;
$xmlLen = strlen($xml);
for($i=0; $i < $xmlLen; ++$i) {
$char = $xml[$i];
// $nextChar = $xml[$i+1];
switch ($char) {
case '<':
if (!$inTag) {
// Seek forward for the next tag boundry
for($j = $i+1; $j < $xmlLen; ++$j) {
$nextChar = $xml[$j];
switch($nextChar) {
case '<': // Means a < in text
$char = htmlentities($char);
break 2;
case '>': // Means we are in a tag
$inTag = true;
break 2;
}
}
} else {
$char = htmlentities($char);
}
break;
case '>':
if (!$inTag) { // No need to seek ahead here
$char = htmlentities($char);
} else {
$inTag = false;
}
break;
default:
if (!$inTag) {
$char = htmlentities($char);
}
break;
}
$xmlOut .= $char;
}
return $xmlOut;
}
Which is a simple state machine noting whether we are in a tag or not and if not then encoding the text using htmlentities.
It's worth noting that this will be memory hungry on large files so you may want to rewrite it as a stream plugin or a pre-processor.
To solve this issue, set the DomDocument recover property to TRUE
before loading XML Document
$dom->recover = TRUE;
Try this code:
$feedURL = '3704017_14022010_050004.xml';
$dom = new DOMDocument();
$dom->recover = TRUE;
$dom->load($feedURL);