Convert non-ASCII chars from ASCII-8BIT to UTF-8

后端 未结 4 1495
鱼传尺愫
鱼传尺愫 2021-02-01 12:46

I\'m pulling text from remote sites and trying to load it into a Ruby 1.9/Rails 3 app that uses utf-8 by default.

Here is an example of some offending text:



        
4条回答
  •  感动是毒
    2021-02-01 13:25

    I've been having issues with character encoding, and the other answers have been helpful, but didn't work for every case. Here's the solution I came up with that forces encoding when possible and transcodes using '?'s when not possible. Here's the solution:

      def encode str
        encoded = str.force_encoding('UTF-8')
        unless encoded.valid_encoding?
          encoded = str.encode("utf-8", invalid: :replace, undef: :replace, replace: '?')
        end
        encoded
      end
    

    force_encoding works most of the time, but I've encountered some strings where that fails. Strings like this will have invalid characters replaced:

     str = "don't panic: \xD3"
     str.valid_encoding?
     false
     str = str.encode("utf-8", invalid: :replace, undef: :replace, replace: '?')
     "don't panic: ?"
     str.valid_encoding?
     true
    

    Update: I have had some issues in production with the above code. I recommend that you set up unit tests with known problem text to make sure that this code works for you like you need it to. Once I come up with version 2 I'll update this answer.

提交回复
热议问题