Sorting UTF-8 strings in RoR

前端 未结 7 890
暗喜
暗喜 2020-12-09 18:12

I am trying to figure out a \'proper\' way of sorting UTF-8 strings in Ruby on Rails.

In my application, I have a select box that is populated with countries. As my

相关标签:
7条回答
  • 2020-12-09 18:21

    The only solution I have found thus far is to use ActiveSupport::Inflector.transliterate(string) to replace the unicode characters with ASCII ones and sort:

    Country.all.sort_by do |country|
      ActiveSupport::Inflector.transliterate country.name
    end
    

    Now the only problem is that this equalizes "ä" with "a" (DIN 5007-1) and I end up with "Ägypten" before "Albanien" while I would expect it to be the other way around. Thankfully the transliteration is configurable about how to replace characters.

    See documentation: http://api.rubyonrails.org/classes/ActiveSupport/Inflector.html#method-i-transliterate

    0 讨论(0)
  • 2020-12-09 18:24

    The only working solution I found so far (at least for Ruby 1.8 because Ruby 1.9 should handle Unicode better) is Unicode by Yoshida Masato. You can find Unicode.strcmp method there.

    EDIT: Sorry, this solution uses NFD decomposition as well with all its limitations.

    0 讨论(0)
  • 2020-12-09 18:31

    http://github.com/grosser/sort_alphabetical

    This gem should help. It adds sort_alphabetical and sort_alphabetical_by methods to Enumberable.

    0 讨论(0)
  • 2020-12-09 18:32

    Ruby peforms string comparisons based on byte values of characters:

    %w[à a e].sort
    # => ["a", "e", "à"]
    

    To properly collate strings according to locale, the ffi-icu gem could be used:

    require "ffi-icu"
    
    ICU::Collation.collate("it_IT", %w[à a e])
    # => ["a", "à", "e"]
    
    ICU::Collation.collate("de", %w[a s x ß])
    # => ["a", "s", "ß", "x"]
    

    As an alternative:

    collator = ICU::Collation::Collator.new("it_IT")
    %w[à a e].sort { |a, b| collator.compare(a, b) }
    # => %w[a à e]
    

    Update To test how strings should collate according to locale rules the ICU project provides this nice tool.

    0 讨论(0)
  • 2020-12-09 18:40

    What you are trying to do is a very messy proposition. There is no way to do transparent transliteration on all Unicode characters because the meaning of digraphs changes from locale to locale, and strings can grow HUGE (if say you replace 10 Chinese symbols with theyr phonetic equivalents). Don't go there.

    Why do you want transliterated names in the first place? For URLs? Browsers handle Unicode URLs decently now, so you are inventing a huge problem out of thin air. If you need IDs, preprocess your lists to include a stable numeric ID per country and use that as an identifier. Or save the English name of the country as identitifer (you can download locale-aware ISO country lists for free).

    If you truly want good transliteration for Unicode (and this is not what you want in this case) see the IBM ICU libraries, there is a dormant gem for them.

    0 讨论(0)
  • 2020-12-09 18:41

    There are a couple of ways to go. You may want to convert the UTF strings to hex strings and then sort them:

    s.split(//).collect { |x| x.unpack('U').to_s }.join
    

    or you may use the library iconv. Read up on it and use it as appropriate (from dzone):

    #add this to environment.rb
    #call to_iso on any UTF8 string to get a ISO string back
    #example : "Cédez le passage aux français".to_iso
    
    class String
      require 'iconv' #this line is not needed in rails !
      def to_iso
        Iconv.conv('ISO-8859-1', 'utf-8', self)
      end
    end
    
    0 讨论(0)
提交回复
热议问题