I currentyl have no clue on how to sort an array which contains UTF-8 encoded strings in PHP. The array comes from a LDAP server so sorting via a database (would be no probl
$a = array( 'Кръстев', 'Делян1', 'делян1', 'Делян2', 'делян3', 'кръстев' );
$col = new \Collator('bg_BG');
$col->asort( $a );
var_dump( $a );
Prints:
array
2 => string 'делян1' (length=11)
1 => string 'Делян1' (length=11)
3 => string 'Делян2' (length=11)
4 => string 'делян3' (length=11)
5 => string 'кръстев' (length=14)
0 => string 'Кръстев' (length=14)
The Collator
class is defined in PECL intl extension. It is distributed with PHP 5.3 sources but might be disabled for some builds. E.g. in Debian it is in package php5-intl .
Collator::compare
is useful for usort
.
Your collation needs to match the character set. Since your data is UTF-8 encoded, you should use a UTF-8 collation. It could be named differently on different platforms, but a good guess would be de_DE.utf8
.
On UNIX systems, you can get a list of currently installed locales with the command
locale -a
This is a very complex issue, since UTF-8 encoded data can contain any Unicode character (i.e. characters from many 8-bit encodings which collate differently in different locales).
Perhaps if you converted your UTF-8 data into Unicode (not familiar with PHP unicode functions, sorry) and then normalized them into NFD or NFKD and then sorting on code points might give some collation that would make sense to you (ie "A" before "Ä").
Check the links I provided.
EDIT: since you mention that your input data are clear (I assume they all fall in the "windows-1252" codepage), then you should do the following conversion: UTF-8 → Unicode → Windows-1252, on which Windows-1252 encoded data do a sort selecting the "CP1252" locale.
Using your example with codepage 1252 worked perfectly fine here on my windows development machine.
$array=array('Birnen', 'Äpfel', 'Ungetüme', 'Apfel', 'Ungetiere', 'Österreich');
$oldLocal=setlocale(LC_COLLATE, "0");
var_dump(setlocale(LC_COLLATE, 'German_Germany.1252'));
usort($array, 'strcoll');
var_dump(setlocale(LC_COLLATE, $oldLocal));
var_dump($array);
...snip...
This was with PHP 5.2.6. btw.
function traceStrColl($a, $b) {
$outValue = strcoll($a, $b);
echo "$a $b $outValue\r\n";
return $outValue;
}
$array=array('Birnen', 'Äpfel', 'Ungetüme', 'Apfel', 'Ungetiere', 'Österreich');
setlocale(LC_COLLATE, 'German_Germany.65001');
usort($array, 'traceStrColl');
print_r($array);
gives:
Ungetüme Äpfel 2147483647 Ungetüme Birnen 2147483647 Ungetüme Apfel 2147483647 Ungetüme Ungetiere 2147483647 Österreich Ungetüme 2147483647 Äpfel Ungetiere 2147483647 Äpfel Birnen 2147483647 Apfel Äpfel 2147483647 Ungetiere Birnen 2147483647
I did find some bug reports which have been flagged being bogus... The best bet you have is filing a bug-report I suppose though...
I found this following helper function to convert all letters of a string to ASCII letters very helpful here.
function _all_letters_to_ASCII($string) {
return strtr(utf8_decode($string),
utf8_decode('ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
}
After that a simple array_multisort()
gives you what you want.
$array = array('Birnen', 'Äpfel', 'Ungetüme', 'Apfel', 'Ungetiere', 'Österreich');
$reference_array = $array;
foreach ($reference_array as $key => &$value) {
$value = _all_letters_to_ASCII($value);
}
var_dump($reference_array);
array_multisort($reference_array, $array);
var_dump($array);
Of course you can make the helper function fit more advanced needs. But for now, it looks pretty good.
array(6) {
[0]=> string(6) "Birnen"
[1]=> string(5) "Apfel"
[2]=> string(8) "Ungetume"
[3]=> string(5) "Apfel"
[4]=> string(9) "Ungetiere"
[5]=> string(10) "Osterreich"
}
array(6) {
[0]=> string(5) "Apfel"
[1]=> string(6) "Äpfel"
[2]=> string(6) "Birnen"
[3]=> string(11) "Österreich"
[4]=> string(9) "Ungetiere"
[5]=> string(9) "Ungetüme"
}
Eventually this problem cannot be solved in a simple way without using recoded strings (UTF-8 → Windows-1252 or ISO-8859-1) as suggested by ΤΖΩΤΖΙΟΥ due to an obvious PHP bug as discovered by Huppie. To summarize the problem, I created the following code snippet which clearly demonstrates that the problem is the strcoll() function when using the 65001 Windows-UTF-8-codepage.
function traceStrColl($a, $b) {
$outValue=strcoll($a, $b);
echo "$a $b $outValue\r\n";
return $outValue;
}
$locale=(defined('PHP_OS') && stristr(PHP_OS, 'win')) ? 'German_Germany.65001' : 'de_DE.utf8';
$string="ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜabcdefghijklmnopqrstuvwxyzäöüß";
$array=array();
for ($i=0; $i<mb_strlen($string, 'UTF-8'); $i++) {
$array[]=mb_substr($string, $i, 1, 'UTF-8');
}
$oldLocale=setlocale(LC_COLLATE, "0");
var_dump(setlocale(LC_COLLATE, $locale));
usort($array, 'traceStrColl');
setlocale(LC_COLLATE, $oldLocale);
var_dump($array);
The result is:
string(20) "German_Germany.65001"
a B 2147483647
[...]
array(59) {
[0]=>
string(1) "c"
[1]=>
string(1) "B"
[2]=>
string(1) "s"
[3]=>
string(1) "C"
[4]=>
string(1) "k"
[5]=>
string(1) "D"
[6]=>
string(2) "ä"
[7]=>
string(1) "E"
[8]=>
string(1) "g"
[...]
The same snippet works on a Linux machine without any problems producing the following output:
string(10) "de_DE.utf8"
a B -1
[...]
array(59) {
[0]=>
string(1) "a"
[1]=>
string(1) "A"
[2]=>
string(2) "ä"
[3]=>
string(2) "Ä"
[4]=>
string(1) "b"
[5]=>
string(1) "B"
[6]=>
string(1) "c"
[7]=>
string(1) "C"
[...]
The snippet also works when using Windows-1252 (ISO-8859-1) encoded strings (of course the mb_* encodings and the locale must be changed then).
I filed a bug report on bugs.php.net: Bug #46165 strcoll() does not work with UTF-8 strings on Windows. If you experience the same problem, you can give your feedback to the PHP team on the bug-report page (two other, probably related, bugs have been classified as bogus - I don't think that this bug is bogus ;-).
Thanks to all of you.