How to read UTF8 encoded file using RandomAccessFile?

前端未结

关注

 8  2115

深忆病人 2020-12-06 05:35

I have text file that was encoded with UTF8 (for language specific characters). I need to use RandomAccessFile to seek specific position and read from.

I want rea

8条回答

执笔经年 (楼主)

2020-12-06 05:55
Once you are positioned on a given line (this means you have answered the first part of your problem, see @martinjs answer), you can read the whole line and make a String out of it using a statement given in the answer by @Matthieu. But to check if the statement in question is correct, we have to ask ourselves 4 questions. It is not self-evident.

Note that the problem of getting at the start of a line may require to analyze the text to build an index if you need to randomly and quickly access many lines.

The statement to read a line and turn it into a String is :
```
String utf8 = new String(raf.readLine().getBytes("ISO-8859-1"), "UTF-8");
```
1. What is a byte in UTF-8 ? That means which values are allowed. We'll see the question is in fact useless once we answer question 2.
2. readLine(). UTF-8 bytes → UTF-16 bytes ok ? Yes. Because UTF-16 gives a meaning to all the integers from 0 to 255 coded on 2 bytes if the most signification byte (MSB) is 0. This is guaranteed by readLine().
3. getBytes("ISO-8859-1"). Characters encoded in UTF-16 (Java String with 1 or 2 char (code unit) per character) → ISO-8859-1 bytes ok ? Yes. The code points of the characters in the Java string are ≤ 255 and ISO-8859-1 is a "raw" encoding which means it can encode every character as a single byte.
4. new String(..., "UTF-8"). ISO-8859-1 bytes → UTF-8 bytes ok ? Yes. Since the original bytes come from UTF-8 encoded text and have been extracted as is, they still represent text encoded in UTF-8.
Concerning the raw nature of ISO-8859-1 in which every byte (value 0 to 255) is mapped onto a character, I copy/paste below the comment I made on the answer by @Matthieu.

See this question concerning the notion of "raw" encoding with ISO-8859-1. Note the difference between ISO/IEC 8859-1 (191 bytes defined) and ISO-8859-1 (256 bytes defined). You can find the definition of ISO-8859-1 in RFC1345 and see that control codes C0 and C1 are mapped onto the 65 unused bytes of ISO/IEC 8859-1.
0 讨论(0)

查看其它8个回答
发布评论:

提交评论
- 加载中...