How to read UTF8 encoded file using RandomAccessFile?

前端未结

关注

 8  2136

I have text file that was encoded with UTF8 (for language specific characters). I need to use RandomAccessFile to seek specific position and read from.

I want rea

相关标签:

8条回答

执笔经年

2020-12-06 05:55
Once you are positioned on a given line (this means you have answered the first part of your problem, see @martinjs answer), you can read the whole line and make a String out of it using a statement given in the answer by @Matthieu. But to check if the statement in question is correct, we have to ask ourselves 4 questions. It is not self-evident.

Note that the problem of getting at the start of a line may require to analyze the text to build an index if you need to randomly and quickly access many lines.

The statement to read a line and turn it into a String is :
```
String utf8 = new String(raf.readLine().getBytes("ISO-8859-1"), "UTF-8");
```
1. What is a byte in UTF-8 ? That means which values are allowed. We'll see the question is in fact useless once we answer question 2.
2. readLine(). UTF-8 bytes → UTF-16 bytes ok ? Yes. Because UTF-16 gives a meaning to all the integers from 0 to 255 coded on 2 bytes if the most signification byte (MSB) is 0. This is guaranteed by readLine().
3. getBytes("ISO-8859-1"). Characters encoded in UTF-16 (Java String with 1 or 2 char (code unit) per character) → ISO-8859-1 bytes ok ? Yes. The code points of the characters in the Java string are ≤ 255 and ISO-8859-1 is a "raw" encoding which means it can encode every character as a single byte.
4. new String(..., "UTF-8"). ISO-8859-1 bytes → UTF-8 bytes ok ? Yes. Since the original bytes come from UTF-8 encoded text and have been extracted as is, they still represent text encoded in UTF-8.
Concerning the raw nature of ISO-8859-1 in which every byte (value 0 to 255) is mapped onto a character, I copy/paste below the comment I made on the answer by @Matthieu.

See this question concerning the notion of "raw" encoding with ISO-8859-1. Note the difference between ISO/IEC 8859-1 (191 bytes defined) and ISO-8859-1 (256 bytes defined). You can find the definition of ISO-8859-1 in RFC1345 and see that control codes C0 and C1 are mapped onto the 65 unused bytes of ISO/IEC 8859-1.
0 讨论(0)
发布评论:

提交评论
- 加载中...

天命终不由人

2020-12-06 05:59

You can convert string, read by readLine to UTF8, using following code:

public static void main(String[] args) throws IOException {
    RandomAccessFile raf = new RandomAccessFile(new File("MyFile.txt"), "r");
    String line = raf.readLine();
    String utf8 = new String(line.getBytes("ISO-8859-1"), "UTF-8");
    System.out.println("Line: " + line);
    System.out.println("UTF8: " + utf8);
}

Content of MyFile.txt: (UTF-8 Encoding)

Привет из Украины

Console output:

Line: ÐÑÐ¸Ð²ÐµÑ Ð¸Ð· Ð£ÐºÑÐ°Ð¸Ð½Ñ
UTF8: Привет из Украины

0 讨论(0)

温柔的废话

2020-12-06 05:59
I realise that this is an old question, but it still seems to have some interest, and no accepted answer.

What you are describing is essentially a data structures problem. The discussion of UTF8 here is a red herring - you would face the same problem using a fixed length encoding such as ASCII, because you have variable length lines. What you need is some kind of index.

If you absolutely can't change the file itself (the "string file") - as seems to be the case - you could always construct an external index. The first time (and only the first time) the string file is accessed, you read it all the way through (sequentially), recording the byte position of the start of every line, and finishing by recording the end-of-file position (to make life simpler). This can be achieved by the following code:
```
myList.add(0); // assuming first string starts at beginning of file
while ((line = myRandomAccessFile.readLine()) != null) {
    myList.add(myRandomAccessFile.getFilePointer());
}
```
You then write these integers into a separate file ("index file"), which you will read back in every subsequent time you start your program and intend to access the string file. To access the nth string, pick the nth and n+1th index from the index file (call these A and B). You then seek to position A in the string file and read B-A bytes, which you then decode from UTF8. For instance, to get line i:
```
myRandomAccessFile.seek(myList.get(i));
byte[] bytes = new byte[myList.get(i+1) - myList.get(i)];
myRandomAccessFile.readFully(bytes);
String result = new String(bytes, "UTF-8");
```
In many cases, however, it would be better to use a database such as SQLite, which creates and maintains the index for you. That way, you can add and modify extra "lines" without having to recreate the entire index. See https://www.sqlite.org/cvstrac/wiki?p=SqliteWrappers for Java implementations.
0 讨论(0)
发布评论:

提交评论
- 加载中...
梦毁少年i

2020-12-06 06:01

The readUTF() method of RandomAccessFile treats first two bytes from the current pointer as size of bytes, after the two bytes from current position, to be read and returned as string.

In order for this method to work, content should be written using writeUTF() method as it uses first two bytes after the current position for saving the content size and then writes the content. Otherwise, most of the times you will get EOFException.

See http://www.zoftino.com/java-random-access-files for details.

0 讨论(0)
发布评论:

提交评论
- 加载中...
日久生厌

2020-12-06 06:08

I find the API for RandomAccessFile is challenging.

If your text is actually limited to UTF-8 values 0-127 (the lowest 7 bits of UTF-8), then it is safe to use readLine(), but read those Javadocs carefully: That is one strange method. To quote:

This method successively reads bytes from the file, starting at the current file pointer, until it reaches a line terminator or the end of the file. Each byte is converted into a character by taking the byte's value for the lower eight bits of the character and setting the high eight bits of the character to zero. This method does not, therefore, support the full Unicode character set.

To read UTF-8 safely, I suggest you read (some or all of the) raw bytes with a combination of length() and read(byte[]). Then convert your UTF-8 bytes to a Java String with this constructor: new String(byte[], "UTF-8").

To write UTF-8 safely, first convert your Java String to the correct bytes with someText.getBytes("UTF-8"). Finally, write the bytes using write(byte[]).

0 讨论(0)
发布评论:

提交评论
- 加载中...
情深已故

2020-12-06 06:09

You aren’t going to be able to go at it this way. The seek function will position you by some number of bytes. There is no guarantee that you are aligned to a UTF-8 character boundary.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页