ruby 1.9 wrong file encoding on windows

问题

I have a ruby file with these contents:

# encoding: iso-8859-1
File.open('foo.txt', "w:iso-8859-1") {|f| f << 'fòo'}
puts File.read('foo.txt').encoding

When I run it from windows command prompt ruby 1.9.3 I get: IBM437
When I run it from cygwin ruby 1.9.3 I get: UTF-8
What I expect to get is: iso-8859-1

Can someone explain what's happening here?

UPDATE

Here's a better description of what I'm looking for:

I understand now thanks to Darshan that by default ruby will load files in Encoding.default _external, but shouldn't the # encoding: iso-8859-1 line override that?
Should ruby be able to auto-detect a file's encoding? Is there any filesystem where the encoding is an attribute?
What is my best option to 'remember' the encoding I saved the file in?

回答1:

You're not specifying the encoding when you read the file. You're being very careful to specify it everywhere except there, but then you're reading it with the default encoding.

File.open('foo.txt', "w:iso-8859-1") {|f| f << 'fòo'.force_encoding('iso-8859-1')}
File.open('foo.txt', "r:iso-8859-1") {|f| puts f.read().encoding }

# => ISO-8859-1

Also note that you probably mean 'fòo'.encode('iso-8859-1') rather than 'fòo'.force_encoding('iso-8859-1'). The latter leaves the bytes unchanged, while the former transcodes the string.

Update: I'll elaborate a bit since I wasn't as clear or thorough as I could have been.

If you don't specify an encoding with File.read(), the file will be read with Encoding.default_external. Since you're not setting that yourself, Ruby is using a value depending on the environment it's run in. In your Windows environment, it's IBM437; in your Cygwin environment, it's UTF-8. So my point above was that of course that's what the encoding is; it has to be, and it has nothing to do with what bytes are contained in the file. Ruby doesn't auto-detect encodings for you.
force_encoding() doesn't change the bytes in a string, it only changes the Encoding attached to those bytes. If you tell Ruby "pretend this string is ISO-8859-1", then it won't transcode them when you tell it "please write this string as ISO-8859-1". encode() transcodes for you, as does writing to the file if you don't trick it into not doing so.

Putting those together, if you have a source file in ISO-8859-1:

# encoding: iso-8859-1

# Write in ISO-8859-1 regardless of default_external
File.open('foo.txt', "w:iso-8859-1") {|f| f << 'fòo'}

# Read in ISO-8859-1 regardless of default_external,
#  transcoding if necessary to default_internal, if set
File.open('foo.txt', "r:iso-8859-1") {|f| puts f.read().encoding } # => ISO-8859-1

puts File.read('foo.txt').encoding # -> Whatever is specified by default_external

If you have a source file in UTF-8:

# encoding: utf-8

# Write in ISO-8859-1 regardless of default_external, transcoding from UTF-8
File.open('foo.txt', "w:iso-8859-1") {|f| f << 'fòo'}

# Read in ISO-8859-1 regardless of default_external,
#  transcoding if necessary to default_internal, if set
File.open('foo.txt', "r:iso-8859-1") {|f| puts f.read().encoding } # => ISO-8859-1

puts File.read('foo.txt').encoding # -> Whatever is specified by default_external

Update 2, to answer your new questions:

No, the # encoding: iso-8859-1 line does not change Encoding.default_external, it only tells Ruby that the source file itself is encoded in ISO-8859-1. Simply add
```
Encoding.default_external = "iso-8859-1"
```
if you expect all files that your read to be stored in that encoding.
No, I don't personally think Ruby should auto-detect encodings, but reasonable people can disagree on that one, and a discussion of "should it be so" seems off-topic here.
Personally, I use UTF-8 for everything, and in the rare circumstances that I can't control encoding, I manually set the encoding when I read the file, as demonstrated above. My source files are always in UTF-8. If you're dealing with files that you can't control and don't know the encoding of, the charguess gem or similar would be useful.

来源：https://stackoverflow.com/questions/11806512/ruby-1-9-wrong-file-encoding-on-windows

标签

ruby

windows

encoding