I\'m trying to load an XML source from a remote location, so i have no control of the formatting. Unfortunately the XML file I\'m trying to load has no encoding:
<
You can try using the XMLReader class instead. The XMLReader is designed specifically for XML and has options for what encoding to use (including 'null' for none).
You've to convert your document into UTF-8, the easiest would be to use utf8_encode().
DOMdocument example:
$doc = new DOMDocument();
$content = utf8_encode(file_get_contents($url));
$doc->loadXML($content);
SimpleXML example:
$xmlInput = simplexml_load_string(utf8_encode(file_get_contents($url_or_file)));
If you don't know the current encoding, use mb_detect_encoding(), for example:
$content = utf8_encode(file_get_contents($url_or_file));
$encoding = mb_detect_encoding($content);
$doc = new DOMdocument();
$res = $doc->loadXML("<?xml encoding='$encoding'>" . $content);
Notes:
$doc->loadHTML
instead, you can still use XML header.If you know the encoding, use iconv() to convert it:
$xml = iconv('ISO-8859-1' ,'UTF-8', $xmlInput)
You could edit the document ('pre-process it') to specify the encoding it is being delivered in adding an XML declaration. What that is, you'll have to ascertain yourself, of course. The DOM object should then parse it.
Example XML declaration:
<?xml version="1.0" encoding="UTF-8" ?>
I ran in to a similar situation. I was getting an XML file that was supposed to be UTF-8 encoded, but it included some bad ISO characters.
I wrote the following code to encode the bad characters to UTF-8
<?php
# The XML file with bad characters
$filename = "sample_xml_file.xml";
# Read file contents to a variable
$contents = file_get_contents($filename);
# Find the bad characters
preg_match_all('/[^(\x20-\x7F)]*/', $contents, $badchars);
# Process bad characters if some were found
if(isset($badchars[0]))
{
# Narrow down the results to uniques only
$badchars[0] = array_unique($badchars[0]);
# Replace the bad characters with their UTF8 equivalents
foreach($badchars[0] as $badchar)
{
$contents = preg_replace("/".$badchar."/", utf8_encode($badchar), $contents);
}
}
# Write the fixed contents back to the file
file_put_contents($filename, $contents);
# Cleanup
unset($contents);
# Now the bad characters have been encoded to UTF8
# It will now load file with DOMDocument
$dom = new DOMDocument();
$dom->load($filename);
?>
I posted about the solution in more detail at: http://dev.strategystar.net/2012/01/convert-bad-characters-to-utf-8-in-an-xml-file-with-php/