I have text file that was encoded with UTF8 (for language specific characters). I need to use RandomAccessFile to seek specific position and read from.
I want rea
Once you are positioned on a given line (this means you have answered the first part of your problem, see @martinjs answer), you can read the whole line and make a String
out of it using a statement given in the answer by @Matthieu. But to check if the statement in question is correct, we have to ask ourselves 4 questions. It is not self-evident.
Note that the problem of getting at the start of a line may require to analyze the text to build an index if you need to randomly and quickly access many lines.
The statement to read a line and turn it into a String
is :
String utf8 = new String(raf.readLine().getBytes("ISO-8859-1"), "UTF-8");
readLine()
. UTF-8 bytes → UTF-16 bytes ok ? Yes. Because UTF-16 gives a meaning to all the integers from 0 to 255 coded on 2 bytes if the most signification byte (MSB) is 0. This is guaranteed by readLine()
.getBytes("ISO-8859-1")
. Characters encoded in UTF-16 (Java String
with 1 or 2 char
(code unit) per character) → ISO-8859-1 bytes ok ? Yes. The code points of the characters in the Java string are ≤ 255 and ISO-8859-1 is a "raw" encoding which means it can encode every character as a single byte.new String(..., "UTF-8")
. ISO-8859-1 bytes → UTF-8 bytes ok ? Yes. Since the original bytes come from UTF-8 encoded text and have been extracted as is, they still represent text encoded in UTF-8.Concerning the raw nature of ISO-8859-1 in which every byte (value 0 to 255) is mapped onto a character, I copy/paste below the comment I made on the answer by @Matthieu.
See this question concerning the notion of "raw" encoding with ISO-8859-1. Note the difference between ISO/IEC 8859-1 (191 bytes defined) and ISO-8859-1 (256 bytes defined). You can find the definition of ISO-8859-1 in RFC1345 and see that control codes C0 and C1 are mapped onto the 65 unused bytes of ISO/IEC 8859-1.
You can convert string, read by readLine to UTF8, using following code:
public static void main(String[] args) throws IOException {
RandomAccessFile raf = new RandomAccessFile(new File("MyFile.txt"), "r");
String line = raf.readLine();
String utf8 = new String(line.getBytes("ISO-8859-1"), "UTF-8");
System.out.println("Line: " + line);
System.out.println("UTF8: " + utf8);
}
Привет из Украины
Line: ÐÑÐ¸Ð²ÐµÑ Ð¸Ð· УкÑаинÑ
UTF8: Привет из Украины
I realise that this is an old question, but it still seems to have some interest, and no accepted answer.
What you are describing is essentially a data structures problem. The discussion of UTF8 here is a red herring - you would face the same problem using a fixed length encoding such as ASCII, because you have variable length lines. What you need is some kind of index.
If you absolutely can't change the file itself (the "string file") - as seems to be the case - you could always construct an external index. The first time (and only the first time) the string file is accessed, you read it all the way through (sequentially), recording the byte position of the start of every line, and finishing by recording the end-of-file position (to make life simpler). This can be achieved by the following code:
myList.add(0); // assuming first string starts at beginning of file
while ((line = myRandomAccessFile.readLine()) != null) {
myList.add(myRandomAccessFile.getFilePointer());
}
You then write these integers into a separate file ("index file"), which you will read back in every subsequent time you start your program and intend to access the string file. To access the n
th string, pick the n
th and n+1
th index from the index file (call these A
and B
). You then seek to position A
in the string file and read B-A
bytes, which you then decode from UTF8. For instance, to get line i
:
myRandomAccessFile.seek(myList.get(i));
byte[] bytes = new byte[myList.get(i+1) - myList.get(i)];
myRandomAccessFile.readFully(bytes);
String result = new String(bytes, "UTF-8");
In many cases, however, it would be better to use a database such as SQLite, which creates and maintains the index for you. That way, you can add and modify extra "lines" without having to recreate the entire index. See https://www.sqlite.org/cvstrac/wiki?p=SqliteWrappers for Java implementations.
The readUTF() method of RandomAccessFile treats first two bytes from the current pointer as size of bytes, after the two bytes from current position, to be read and returned as string.
In order for this method to work, content should be written using writeUTF() method as it uses first two bytes after the current position for saving the content size and then writes the content. Otherwise, most of the times you will get EOFException.
See http://www.zoftino.com/java-random-access-files for details.
I find the API for RandomAccessFile
is challenging.
If your text is actually limited to UTF-8 values 0-127 (the lowest 7 bits of UTF-8), then it is safe to use readLine()
, but read those Javadocs carefully: That is one strange method. To quote:
This method successively reads bytes from the file, starting at the current file pointer, until it reaches a line terminator or the end of the file. Each byte is converted into a character by taking the byte's value for the lower eight bits of the character and setting the high eight bits of the character to zero. This method does not, therefore, support the full Unicode character set.
To read UTF-8 safely, I suggest you read (some or all of the) raw bytes with a combination of length()
and read(byte[])
. Then convert your UTF-8 bytes to a Java String
with this constructor: new String(byte[], "UTF-8")
.
To write UTF-8 safely, first convert your Java String
to the correct bytes with someText.getBytes("UTF-8")
. Finally, write the bytes using write(byte[])
.
You aren’t going to be able to go at it this way. The seek
function will position you by some number of bytes. There is no guarantee that you are aligned to a UTF-8 character boundary.