Charset of Java source file and failing test

问题

First, I'd like to say that I've spent a lot of time searching for an explanation/solution. I've found hints of the problem, but no way to resolve my particular issue. Hence the post on a topic that seems to have been beaten to death in at least some cases.

I have a Java test class that tests for proper encoding/decoding by a Mime utility. The strings used for testing are declared in the source file and we use assertEquals() to test equality after processing the input string. Here's an example:

String test = "S2, =?iso-8859-1?Q?F=E4ltstr=F6m?= =?iso-8859-1?Q?,_Patrik?= S3";
String expected = "S2, Fältström, PatrikS3";

In my editor (and other external editors such as Notepad++ and UltraEdit), the input strings are properly displayed if I choose to read it as windows-1252 or ISO-8859-1 encoding; UTF-8 displays the expected string as "F�ltstr�m".

When compiled and run on a Windows 7 machine, I get the following output:

Expected :S2, F�ltstr�m, PatrikS3

Actual :S2, Fältström, PatrikS3

I get this behaviour in a command shell as well as in my code editor. Bizarrely, it works on a Windows XP machine. Yet I checked the codepage using chcp in a command shell and I get the same output in both cases. The only way I got this to work is to compile the class using "-encoding windows-1252", which I don't want to be doing for a variety of reasons.

So the questions are: 1) what's different between XP and Windows 7 that makes this fail? Has the default platform encoding changed? 2) how can I fix so that it will work both on a Windows 7 machine and a Linux machine?

Thanks a lot for any insight!

回答1:

It looks like the default encoding used on your Windows 7 machine is UTF-8, while on Windows XP it is Windows-1252. So: always be explicit in the encoding your files use when compiling, don't depend on the platform default.

BTW: As far as I know java on my Windows 7 machine still uses Windows-1252 as the default.

回答2:

Regarding how to fix it, I would suggest that you store your test data in a file or files. Ensure that the files are saved with the required encoding. Load your test data at runtime using the required encoding. This decouples your tests from compiler encoding.

回答3:

I'm not an expert in this matter but to see if they are indeed different go to:
Regional and Language Options -> Control Panel -> Advanced options tab

In general you cannot expect all your users to use the Windows default latin charset and why should you?Also, think about other operating systems which use other default encodings (*nix, MACs etc).
This leaves you with the option of guessing because, say, if you have the latin character A you cannot discern if it's in ASCII, UTF-8 or ISO-8859-1 because these charsets map the character to the same entry in the character table (in our case table entry 41 in hexadecimal notation)!
If you really want somehow to solve this there is no perfect solution but using CharsetEncoder ( Java SE 7 - CharsetEncoder ) and CharsetDecoder ( Java SE 7 - Charset Decoder) you may be able to treat the characters in a specific format and encode/decode them as bytes. However, there are still some disadvantages in this approach such as:
1)You cannot expect all character mappings to be detected successfully.
2)It's a killer in perfomance when doing multiple/heavy I/Os.

Your best bet, in my opinion, is one: CONVENTION

Enforce your own encoding-decoding (i.e UTF-8) with Unix style line-endings (/n) and treat all files as such. If you expect to read files produced by others and you expect to read characters that cannot be mapped in your encoding then try to use a "bigger" charset (UTF-16) or read the "illegal" character in bytes and write it with your own encoding in bytes (it will be written in an unreadable/non-representable format however!)

My $0.02 cents. Have fun :)

EDIT:Check this post also: Charset conversion Java

回答4:

The prior answers suffice.

As you mentioned it. For your information, in our projects we set the (java) source encoding to UTF-8 to stay international and having no need to revert to \uXXXX escaping. Readers and Writers explicitly mention the encoding. In fact also in our national projects we hold to UTF-8. I think UTF-8 might be an emerging convention.

BufferedReader in = new BufferedReader(
      new InputStreamReader(new FileInputStream(is), "UTF-8"));

Mime string escapes are not needed in the java mail API which can handle UTF-8 in subjects and content.

来源：https://stackoverflow.com/questions/8328956/charset-of-java-source-file-and-failing-test

标签

java

character-encoding

compilation

file-encodings