问题
I am importing contents from an Excel-generated CSV-file into an XML document like:
$csv = fopen($csvfile, r);
$words = array();
while (($pair = fgetcsv($csv)) !== FALSE) {
array_push($words, array('en' => $pair[0], 'de' => $pair[1]));
}
The inserted data are English/German expressions.
I insert these values into an XML structure and output the XML as following:
$dictionary = new SimpleXMLElement('<dictionary></dictionary>');
//do things
$dom = dom_import_simplexml($dictionary) -> ownerDocument;
$dom -> formatOutput = true;
header('Content-encoding: utf-8'); //<3 UTF-8
header('Content-type: text/xml'); //Headers set to correct mime-type for XML output!!!!
echo $dom -> saveXML();
This is working fine, yet I am encountering one really strange problem. When the first letter of a String is an Umlaut (like in Österreich
or Ägypten
) the character will be omitted, resulting in gypten
or sterreich
. If the Umlaut is in the middle of the String (Russische Föderation
) it gets transferred correctly. Same goes for things like ß
or é
or whatever.
All files are UTF-8 encoded and served in UTF-8.
This seems rather strange and bug-like to me, yet maybe I am missing something, there's a lot of smart people around here.
回答1:
Ok, so this seems to be a bug in fgetcsv
.
I am now processing the CSV data on my own (a little cumbersome), but it is working and I do not have any encoding issues at all.
This is (a not-yet-optimized version of) what I am doing:
$rawCSV = file_get_contents($csvfile);
$lines = preg_split ('/$\R?^/m', $rawCSV); //split on line breaks in all operating systems: http://stackoverflow.com/a/7498886/797194
foreach ($lines as $line) {
array_push($words, getCSVValues($line));
}
The getCSVValues
is coming from here and is needed to deal with CSV lines like this (commas!):
"I'm a string, what should I do when I need commas?",Howdy there
It looks like:
function getCSVValues($string, $separator=","){
$elements = explode($separator, $string);
for ($i = 0; $i < count($elements); $i++) {
$nquotes = substr_count($elements[$i], '"');
if ($nquotes %2 == 1) {
for ($j = $i+1; $j < count($elements); $j++) {
if (substr_count($elements[$j], '"') %2 == 1) { // Look for an odd-number of quotes
// Put the quoted string's pieces back together again
array_splice($elements, $i, $j-$i+1,
implode($separator, array_slice($elements, $i, $j-$i+1)));
break;
}
}
}
if ($nquotes > 0) {
// Remove first and last quotes, then merge pairs of quotes
$qstr =& $elements[$i];
$qstr = substr_replace($qstr, '', strpos($qstr, '"'), 1);
$qstr = substr_replace($qstr, '', strrpos($qstr, '"'), 1);
$qstr = str_replace('""', '"', $qstr);
}
}
return $elements;
}
Quite a bit of a workaround, but it seems to work fine.
EDIT:
There's a also a filed bug for this, apparently this depends on the locale settings.
回答2:
If the string comes from Excel (I had problems with the letter ø disappearing if it was in the beginning of the string) ... then this fixed it:
setlocale(LC_ALL, 'en_US.ISO-8859-1');
回答3:
If other umlauts in the middle appear ok, then this is not a base encoding issue. The fact that it happens at the beginning of the line probably indicates some incompatibility with the newline mark. Perhaps the CSV was generated with a different newline encoding.
This happens when moving files between different OS:
- Windows:
\r\n
(characters 13 and 10) - Linux:
\n
(character 10) - Mac OS:
\r
(character 13)
If I were you, I would verify the newline mark to be sure.
If in Linux: hexdump -C filename | more
and inspect the document.
You can change the newline marks with a sed
expression if that's the case.
Hope that helped!
回答4:
A bit simpler workaround (but pretty dirty):
//1. replace delimiter in input string with delimiter + some constant
$dataLine = str_replace($this->fieldDelimiter, $this->fieldDelimiter . $this->bugFixer, $dataLine);
//2. parse
$parsedLine = str_getcsv($dataLine, $this->fieldDelimiter);
//3. remove the constant from resulting strings.
foreach ($parsedLine as $i => $parsedField)
{
$parsedLine[$i] = str_replace($this->bugFixer, '', $parsedField);
}
回答5:
Could be some sort of utf8_encode()
problem. This comment on the documentation page seems to indicate if you encode an Umlaut when it's already encoded, it could cause issues.
Maybe test to see if the data is already utf-8 encoded with mb_detect_encoding().
来源:https://stackoverflow.com/questions/12390851/fgetcsv-is-eating-the-first-letter-of-a-string-if-its-an-umlaut