How to read UTF8 encoded file using RandomAccessFile?

前端 未结 8 2115
深忆病人
深忆病人 2020-12-06 05:35

I have text file that was encoded with UTF8 (for language specific characters). I need to use RandomAccessFile to seek specific position and read from.

I want rea

8条回答
  •  执笔经年
    2020-12-06 05:55

    Once you are positioned on a given line (this means you have answered the first part of your problem, see @martinjs answer), you can read the whole line and make a String out of it using a statement given in the answer by @Matthieu. But to check if the statement in question is correct, we have to ask ourselves 4 questions. It is not self-evident.

    Note that the problem of getting at the start of a line may require to analyze the text to build an index if you need to randomly and quickly access many lines.

    The statement to read a line and turn it into a String is :

    String utf8 = new String(raf.readLine().getBytes("ISO-8859-1"), "UTF-8");
    
    1. What is a byte in UTF-8 ? That means which values are allowed. We'll see the question is in fact useless once we answer question 2.
    2. readLine(). UTF-8 bytes → UTF-16 bytes ok ? Yes. Because UTF-16 gives a meaning to all the integers from 0 to 255 coded on 2 bytes if the most signification byte (MSB) is 0. This is guaranteed by readLine().
    3. getBytes("ISO-8859-1"). Characters encoded in UTF-16 (Java String with 1 or 2 char (code unit) per character) → ISO-8859-1 bytes ok ? Yes. The code points of the characters in the Java string are ≤ 255 and ISO-8859-1 is a "raw" encoding which means it can encode every character as a single byte.
    4. new String(..., "UTF-8"). ISO-8859-1 bytes → UTF-8 bytes ok ? Yes. Since the original bytes come from UTF-8 encoded text and have been extracted as is, they still represent text encoded in UTF-8.

    Concerning the raw nature of ISO-8859-1 in which every byte (value 0 to 255) is mapped onto a character, I copy/paste below the comment I made on the answer by @Matthieu.

    See this question concerning the notion of "raw" encoding with ISO-8859-1. Note the difference between ISO/IEC 8859-1 (191 bytes defined) and ISO-8859-1 (256 bytes defined). You can find the definition of ISO-8859-1 in RFC1345 and see that control codes C0 and C1 are mapped onto the 65 unused bytes of ISO/IEC 8859-1.

提交回复
热议问题