问题
I was processing several txt files with a simple Java program, and the first step of my process is counting the lines of each file:
int count = 0;
br = new BufferedReader(new FileReader(myFile)); // myFile is the txt file in question
while (br.readLine() != null) {
count++;
}
For one of my files, Java was counting exactly twice as many lines as there really were! This was confusing me greatly at first. I opened each file in Notepad++ and could see that the mis-counting file ended every line in exactly the same way as the other files, with a CR and LF. I did a little more poking around and noticed that all my "ok" files were ANSI encoded, and the one problem file was encoded as UCS-2 Little Endian (which I know nothing about). I got these files elsewhere, so I have no idea why the one was encoded that way, but of course switching it to ANSI fixed the issue.
But now curiosity remains. Why was the encoding causing a double line count report?
Thanks!
回答1:
Simple: if you apply the wrong encoding when reading UCS-2 (or UTF-16) text (e.g. ANSI, or any 8-bit encoding), then every second character is a 0x0. This then breaks the CR-LF to CR-0-LF, which is seen as two line changes (one for CR and one for LF).
回答2:
This is the problem:
new FileReader(myFile)
That will use the platform default encoding. Don't do that. Use
new InputStreamReader(new FileInputStream(myFile), encoding)
where encoding is the appropriate encoding for the file. You've got to use the right encoding, or you won't read the file properly. Unfortunately of course that relies on you knowing the encoding...
EDIT: To answer the question of why the lines were double counted rather than just "how do I fix it", see Lucero's answer :)
来源:https://stackoverflow.com/questions/10070431/file-encoded-as-ucs-2-little-endian-reports-2x-too-many-lines-to-java