Character Encoding issue in Rails v3/Ruby 1.9.2

蓝咒 提交于 2019-12-28 04:06:25

问题


I get this error sometimes "invalid byte sequence in UTF-8" when I read contents from a file. Note - this only happens when there are some special characters in the string. I have tried opening the file without "r:UTF-8", but still get the same error.

open(file, "r:UTF-8").each_line { |line| puts line.strip(",") } # line.strip generates the error

Contents of the file:

# encoding: UTF-8
290919,"SE","26","Sk‰l","",59.4500,17.9500,, # this errors out
290956,"CZ","45","HornÌ Bradlo","",49.8000,15.7500,, # this errors out
290958,"NO","02","Svaland","",58.4000,8.0500,, # this works

This is the CSV file I got from outside and I am trying to import it into my DB, it did not come with "# encoding: UTF-8" at the top, but I added this since I read somewhere it will fix this problem, but it did not. :(

Environment:

  • Rails v3.0.3
  • ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10.5.0]

回答1:


Ruby has a notion of an external encoding and internal encoding for each file. This allows you to work with a file in UTF-8 in your source, even when the file is stored in a more esoteric format. If your default external encoding is UTF-8 (which it is if you're on Mac OS X), all of your file I/O is going to be in UTF-8 as well. You can check this using File.open('file').external_encoding. What you're doing when you opening your file and passing "r:UTF-8" is forcing the same external encoding that Ruby is using by default.

Chances are, your source document isn't in UTF-8 and those non-ascii characters aren't mapping cleanly to UTF-8 (if they were, you would either get the correct characters and no error, and if they mapped by incorrectly, you would get incorrect characters and no error). What you should do is try to determine the encoding of the source document, then have Ruby transcode the document on read, like so:

File.open(file, "r:windows-1251:utf-8").each_line { |line| puts line.strip(",") }

If you need help determining the encoding of the source, give this Python library a whirl. It's based on the automatic charset detection fallback that was in Seamonkey/Mozilla (and is possibly still in Firefox).




回答2:


If you want to change your file encoding, you can use gem 'charlock holmes'

https://github.com/brianmario/charlock_holmes

$require 'charlock_holmes/string'
content = File.read('test2.txt')
if !content.is_utf8?
  detection = CharlockHolmes::EncodingDetector.detect(content)
  utf8_encoded_content = CharlockHolmes::Converter.convert content, detection[:encoding], 'UTF-8'
end

Then you can save your new content in a temp file and overwrite your original file.
Hope this help.



来源:https://stackoverflow.com/questions/4697413/character-encoding-issue-in-rails-v3-ruby-1-9-2

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!