问题
I have a ruby file with these contents:
# encoding: iso-8859-1
File.open('foo.txt', "w:iso-8859-1") {|f| f << 'fòo'}
puts File.read('foo.txt').encoding
- When I run it from windows command prompt ruby 1.9.3 I get: IBM437
- When I run it from cygwin ruby 1.9.3 I get: UTF-8
- What I expect to get is: iso-8859-1
Can someone explain what's happening here?
UPDATE
Here's a better description of what I'm looking for:
- I understand now thanks to Darshan that by default ruby will load files in Encoding.default _external, but shouldn't the # encoding: iso-8859-1 line override that?
- Should ruby be able to auto-detect a file's encoding? Is there any filesystem where the encoding is an attribute?
- What is my best option to 'remember' the encoding I saved the file in?
回答1:
You're not specifying the encoding when you read the file. You're being very careful to specify it everywhere except there, but then you're reading it with the default encoding.
File.open('foo.txt', "w:iso-8859-1") {|f| f << 'fòo'.force_encoding('iso-8859-1')}
File.open('foo.txt', "r:iso-8859-1") {|f| puts f.read().encoding }
# => ISO-8859-1
Also note that you probably mean 'fòo'.encode('iso-8859-1')
rather than 'fòo'.force_encoding('iso-8859-1')
. The latter leaves the bytes unchanged, while the former transcodes the string.
Update: I'll elaborate a bit since I wasn't as clear or thorough as I could have been.
If you don't specify an encoding with
File.read()
, the file will be read withEncoding.default_external
. Since you're not setting that yourself, Ruby is using a value depending on the environment it's run in. In your Windows environment, it's IBM437; in your Cygwin environment, it's UTF-8. So my point above was that of course that's what the encoding is; it has to be, and it has nothing to do with what bytes are contained in the file. Ruby doesn't auto-detect encodings for you.force_encoding()
doesn't change the bytes in a string, it only changes the Encoding attached to those bytes. If you tell Ruby "pretend this string is ISO-8859-1", then it won't transcode them when you tell it "please write this string as ISO-8859-1".encode()
transcodes for you, as does writing to the file if you don't trick it into not doing so.
Putting those together, if you have a source file in ISO-8859-1:
# encoding: iso-8859-1
# Write in ISO-8859-1 regardless of default_external
File.open('foo.txt', "w:iso-8859-1") {|f| f << 'fòo'}
# Read in ISO-8859-1 regardless of default_external,
# transcoding if necessary to default_internal, if set
File.open('foo.txt', "r:iso-8859-1") {|f| puts f.read().encoding } # => ISO-8859-1
puts File.read('foo.txt').encoding # -> Whatever is specified by default_external
If you have a source file in UTF-8:
# encoding: utf-8
# Write in ISO-8859-1 regardless of default_external, transcoding from UTF-8
File.open('foo.txt', "w:iso-8859-1") {|f| f << 'fòo'}
# Read in ISO-8859-1 regardless of default_external,
# transcoding if necessary to default_internal, if set
File.open('foo.txt', "r:iso-8859-1") {|f| puts f.read().encoding } # => ISO-8859-1
puts File.read('foo.txt').encoding # -> Whatever is specified by default_external
Update 2, to answer your new questions:
No, the
# encoding: iso-8859-1
line does not changeEncoding.default_external
, it only tells Ruby that the source file itself is encoded in ISO-8859-1. Simply addEncoding.default_external = "iso-8859-1"
if you expect all files that your read to be stored in that encoding.
No, I don't personally think Ruby should auto-detect encodings, but reasonable people can disagree on that one, and a discussion of "should it be so" seems off-topic here.
Personally, I use UTF-8 for everything, and in the rare circumstances that I can't control encoding, I manually set the encoding when I read the file, as demonstrated above. My source files are always in UTF-8. If you're dealing with files that you can't control and don't know the encoding of, the charguess gem or similar would be useful.
来源:https://stackoverflow.com/questions/11806512/ruby-1-9-wrong-file-encoding-on-windows