How to sort an array of UTF-8 strings?

后端 未结 8 767
刺人心
刺人心 2020-11-27 04:32

I currentyl have no clue on how to sort an array which contains UTF-8 encoded strings in PHP. The array comes from a LDAP server so sorting via a database (would be no probl

相关标签:
8条回答
  • 2020-11-27 04:52
    $a = array( 'Кръстев', 'Делян1', 'делян1', 'Делян2', 'делян3', 'кръстев' );
    $col = new \Collator('bg_BG');
    $col->asort( $a );
    var_dump( $a );
    

    Prints:

    array
      2 => string 'делян1' (length=11)
      1 => string 'Делян1' (length=11)
      3 => string 'Делян2' (length=11)
      4 => string 'делян3' (length=11)
      5 => string 'кръстев' (length=14)
      0 => string 'Кръстев' (length=14)
    

    The Collator class is defined in PECL intl extension. It is distributed with PHP 5.3 sources but might be disabled for some builds. E.g. in Debian it is in package php5-intl .

    Collator::compare is useful for usort.

    0 讨论(0)
  • 2020-11-27 04:52

    Your collation needs to match the character set. Since your data is UTF-8 encoded, you should use a UTF-8 collation. It could be named differently on different platforms, but a good guess would be de_DE.utf8.

    On UNIX systems, you can get a list of currently installed locales with the command

    locale -a
    
    0 讨论(0)
  • 2020-11-27 04:53

    This is a very complex issue, since UTF-8 encoded data can contain any Unicode character (i.e. characters from many 8-bit encodings which collate differently in different locales).

    Perhaps if you converted your UTF-8 data into Unicode (not familiar with PHP unicode functions, sorry) and then normalized them into NFD or NFKD and then sorting on code points might give some collation that would make sense to you (ie "A" before "Ä").

    Check the links I provided.

    EDIT: since you mention that your input data are clear (I assume they all fall in the "windows-1252" codepage), then you should do the following conversion: UTF-8 → Unicode → Windows-1252, on which Windows-1252 encoded data do a sort selecting the "CP1252" locale.

    0 讨论(0)
  • 2020-11-27 04:53

    Using your example with codepage 1252 worked perfectly fine here on my windows development machine.

    $array=array('Birnen', 'Äpfel', 'Ungetüme', 'Apfel', 'Ungetiere', 'Österreich');
    $oldLocal=setlocale(LC_COLLATE, "0");
    var_dump(setlocale(LC_COLLATE, 'German_Germany.1252'));
    usort($array, 'strcoll');
    var_dump(setlocale(LC_COLLATE, $oldLocal));
    var_dump($array);
    

    ...snip...

    This was with PHP 5.2.6. btw.


    The above example is wrong, it uses ASCII encoding instead of UTF-8. I did trace the strcoll() calls and look what I found:

    function traceStrColl($a, $b) {
        $outValue = strcoll($a, $b);
        echo "$a $b $outValue\r\n";
        return $outValue;
    }
    
    $array=array('Birnen', 'Äpfel', 'Ungetüme', 'Apfel', 'Ungetiere', 'Österreich');
    setlocale(LC_COLLATE, 'German_Germany.65001');
    usort($array, 'traceStrColl');
    print_r($array);
    

    gives:

    Ungetüme Äpfel 2147483647
    Ungetüme Birnen 2147483647
    Ungetüme Apfel 2147483647
    Ungetüme Ungetiere 2147483647
    Österreich Ungetüme 2147483647
    Äpfel Ungetiere 2147483647
    Äpfel Birnen 2147483647
    Apfel Äpfel 2147483647
    Ungetiere Birnen 2147483647

    I did find some bug reports which have been flagged being bogus... The best bet you have is filing a bug-report I suppose though...

    0 讨论(0)
  • 2020-11-27 04:58

    I found this following helper function to convert all letters of a string to ASCII letters very helpful here.

    function _all_letters_to_ASCII($string) {
      return strtr(utf8_decode($string), 
        utf8_decode('ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
        'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
    }
    

    After that a simple array_multisort() gives you what you want.

    $array = array('Birnen', 'Äpfel', 'Ungetüme', 'Apfel', 'Ungetiere', 'Österreich');
    $reference_array = $array;
    
    foreach ($reference_array as $key => &$value) {
      $value = _all_letters_to_ASCII($value);
    }
    var_dump($reference_array);
    
    array_multisort($reference_array, $array);
    var_dump($array);
    

    Of course you can make the helper function fit more advanced needs. But for now, it looks pretty good.

    array(6) {
      [0]=> string(6) "Birnen"
      [1]=> string(5) "Apfel"
      [2]=> string(8) "Ungetume"
      [3]=> string(5) "Apfel"
      [4]=> string(9) "Ungetiere"
      [5]=> string(10) "Osterreich"
    }
    
    array(6) {
      [0]=> string(5) "Apfel"
      [1]=> string(6) "Äpfel"
      [2]=> string(6) "Birnen"
      [3]=> string(11) "Österreich"
      [4]=> string(9) "Ungetiere"
      [5]=> string(9) "Ungetüme"
    }
    
    0 讨论(0)
  • 2020-11-27 05:00

    Eventually this problem cannot be solved in a simple way without using recoded strings (UTF-8 → Windows-1252 or ISO-8859-1) as suggested by ΤΖΩΤΖΙΟΥ due to an obvious PHP bug as discovered by Huppie. To summarize the problem, I created the following code snippet which clearly demonstrates that the problem is the strcoll() function when using the 65001 Windows-UTF-8-codepage.

    function traceStrColl($a, $b) {
        $outValue=strcoll($a, $b);
        echo "$a $b $outValue\r\n";
        return $outValue;
    }
    
    $locale=(defined('PHP_OS') && stristr(PHP_OS, 'win')) ? 'German_Germany.65001' : 'de_DE.utf8';
    
    $string="ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜabcdefghijklmnopqrstuvwxyzäöüß";
    $array=array();
    for ($i=0; $i<mb_strlen($string, 'UTF-8'); $i++) {
        $array[]=mb_substr($string, $i, 1, 'UTF-8');
    }
    $oldLocale=setlocale(LC_COLLATE, "0");
    var_dump(setlocale(LC_COLLATE, $locale));
    usort($array, 'traceStrColl');
    setlocale(LC_COLLATE, $oldLocale);
    var_dump($array);
    

    The result is:

    string(20) "German_Germany.65001"
    a B 2147483647
    [...]
    array(59) {
      [0]=>
      string(1) "c"
      [1]=>
      string(1) "B"
      [2]=>
      string(1) "s"
      [3]=>
      string(1) "C"
      [4]=>
      string(1) "k"
      [5]=>
      string(1) "D"
      [6]=>
      string(2) "ä"
      [7]=>
      string(1) "E"
      [8]=>
      string(1) "g"
      [...]
    

    The same snippet works on a Linux machine without any problems producing the following output:

    string(10) "de_DE.utf8"
    a B -1
    [...]
    array(59) {
      [0]=>
      string(1) "a"
      [1]=>
      string(1) "A"
      [2]=>
      string(2) "ä"
      [3]=>
      string(2) "Ä"
      [4]=>
      string(1) "b"
      [5]=>
      string(1) "B"
      [6]=>
      string(1) "c"
      [7]=>
      string(1) "C"
      [...]
    

    The snippet also works when using Windows-1252 (ISO-8859-1) encoded strings (of course the mb_* encodings and the locale must be changed then).

    I filed a bug report on bugs.php.net: Bug #46165 strcoll() does not work with UTF-8 strings on Windows. If you experience the same problem, you can give your feedback to the PHP team on the bug-report page (two other, probably related, bugs have been classified as bogus - I don't think that this bug is bogus ;-).

    Thanks to all of you.

    0 讨论(0)
提交回复
热议问题