how to convert character encoding with ruby 1.9

女生的网名这么多〃 提交于 2019-11-29 13:06:02

问题


i am currently having trouble with results from the amazon api.

the service returns a string with unicode characters: Learn Objective\xE2\x80\x93C on the Mac (Learn Series)

with ruby 1.9.1 the string could not even been processed:

REXML::ParseException: #<Encoding::CompatibilityError: incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string)>

...

Exception parsing

Line: 1

Position: 1636

Last 80 unconsumed characters:

Learn Objective–C on the Mac (Learn Series)

回答1:


As the exception points, your string is ASCII-8BIT encoded. You should change the encoding. There is a long story about that, but if you are interested in quick solution, just force_encoding on the string before you do any processing:

s = "Learn Objective\xE2\x80\x93C on the Mac"
# => "Learn Objective\xE2\x80\x93C on the Mac"
s.encoding
# => #<Encoding:ASCII-8BIT>
s.force_encoding 'utf-8'
# => "Learn Objective–C on the Mac"



回答2:


Mladen's solution works if everything that is encoded in ASCII-8BIT can actually be converted directly to UTF-8. It breaks when there are characters that are 1) invalid, or 2) undefined in UTF-8. However, this will work (in 1.9.2 and up:

new_str = s.encode('utf-8', 'binary', :invalid => :replace, 
  :undef => :replace, :replace => '')

ASCII-8BIT is effectively binary. This code converts the encoding to UTF-8, while properly dealing with invalid and undefined characters. The :invalid option specifies that invalid characters be replaced. The :undef option specifies that undefined characters be replaced. And the :replace option defines what the invalid or undefined characters should be replaced with. In this case, I opted to simply remove them.



来源:https://stackoverflow.com/questions/3159742/how-to-convert-character-encoding-with-ruby-1-9

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!