How can I convert a string from windows-1252 to utf-8 in Ruby?

前端 未结 5 1995
日久生厌
日久生厌 2020-12-03 12:04

I\'m migrating some data from MS Access 2003 to MySQL 5.0 using Ruby 1.8.6 on Windows XP (writing a Rake task to do this).

Turns out the Windows string data is encod

相关标签:
5条回答
  • 2020-12-03 12:36

    If you're NOT on Ruby 1.9, and assuming yhager's command works, you could try

    File.open('/tmp/w1252', 'w') do |file|
      my_windows_1252_string.each_byte do |byte|
        file << byte
      end
    end
    
    `iconv -f windows-1252 -t utf-8 /tmp/w1252 > /tmp/utf8`
    
    my_utf_8_string = File.read('/tmp/utf8')
    
    ['/tmp/w1252', '/tmp/utf8'].each do |path|
      FileUtils.rm path
    end
    
    0 讨论(0)
  • 2020-12-03 12:45

    For Ruby 1.8.6, it appears you can use Ruby Iconv, part of the standard library:

    Iconv documentation

    According this helpful article, it appears you can at least purge unwanted win-1252 characters from your string like so:

    ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
    valid_string = ic.iconv(untrusted_string + ' ')[0..-2]
    

    One might then attempt to do a full conversion like so:

    ic = Iconv.new('UTF-8', 'WINDOWS-1252')
    valid_string = ic.iconv(untrusted_string + ' ')[0..-2]
    
    0 讨论(0)
  • 2020-12-03 12:45

    If you're on Ruby 1.9...

    string_in_windows_1252 = database.get(...)
    # => "Fåbulous"
    
    string_in_windows_1252.encoding
    # => "windows-1252"
    
    string_in_utf_8 = string_in_windows_1252.encode('UTF-8')
    # => "Fabulous"
    
    string_in_utf_8.encoding
    # => 'UTF-8'
    
    0 讨论(0)
  • 2020-12-03 12:52

    If you want to convert a file named win1252file, on a unix OS, run:

    $ iconv -f windows-1252 -t utf-8 win1252_file > utf8_file
    

    You should probably be able to do the same on Windows with cygwin.

    0 讨论(0)
  • 2020-12-03 12:53

    Hy,

    I had the exact same problem.

    These tips helped me get goin:

    Always check for the proper encoding name in order to feed your conversion tools correctly. In doubt you can get a list of supported encodings for iconv or recode using:

    $ recode -l
    

    or

    $ iconv -l
    

    Always start from you original file and encode a sample to work with:

    $ recode windows-1252..u8 < original.txt > sample_utf8.txt
    

    or

    $ iconv -f windows-1252 -t utf8 original.txt -o sample_utf8.txt
    

    Install Ruby1.9, because it helps you A LOT when it comes to encodings. Even if you don't use it in your programm, you can always start an irb1.9 session and pick on the strings to see what the output is. File.open has a new 'mode' parameter in Ruby 1.9. Use it! This article helped a lot: http://blog.nuclearsquid.com/writings/ruby-1-9-encodings

    File.open('original.txt', 'r:windows-1252:utf-8')
    # This opens a file specifying all encoding options. r:windows-1252 means read it as windows-1252. :utf-8 means treat it as utf-8 internally.
    

    Have fun and swear a lot!

    0 讨论(0)
提交回复
热议问题