File encoded as UCS-2 Little Endian reports 2x too many lines to Java

强颜欢笑 提交于 2020-01-03 17:23:00

问题


I was processing several txt files with a simple Java program, and the first step of my process is counting the lines of each file:

int count = 0;
br = new BufferedReader(new FileReader(myFile)); // myFile is the txt file in question
while (br.readLine() != null) {
    count++;
}

For one of my files, Java was counting exactly twice as many lines as there really were! This was confusing me greatly at first. I opened each file in Notepad++ and could see that the mis-counting file ended every line in exactly the same way as the other files, with a CR and LF. I did a little more poking around and noticed that all my "ok" files were ANSI encoded, and the one problem file was encoded as UCS-2 Little Endian (which I know nothing about). I got these files elsewhere, so I have no idea why the one was encoded that way, but of course switching it to ANSI fixed the issue.

But now curiosity remains. Why was the encoding causing a double line count report?

Thanks!


回答1:


Simple: if you apply the wrong encoding when reading UCS-2 (or UTF-16) text (e.g. ANSI, or any 8-bit encoding), then every second character is a 0x0. This then breaks the CR-LF to CR-0-LF, which is seen as two line changes (one for CR and one for LF).




回答2:


This is the problem:

new FileReader(myFile)

That will use the platform default encoding. Don't do that. Use

new InputStreamReader(new FileInputStream(myFile), encoding)

where encoding is the appropriate encoding for the file. You've got to use the right encoding, or you won't read the file properly. Unfortunately of course that relies on you knowing the encoding...

EDIT: To answer the question of why the lines were double counted rather than just "how do I fix it", see Lucero's answer :)



来源:https://stackoverflow.com/questions/10070431/file-encoded-as-ucs-2-little-endian-reports-2x-too-many-lines-to-java

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!