Malformed binary serialization of HashMap<String,Double>

老子叫甜甜 提交于 2019-12-11 17:52:53

问题


I wrote some code to serialize a HashMap<String,Double> by iterating entries and serializing each of them instead of using ObjectOutputStream.readObject(). The reason is just efficiency: the resulting file is much smaller and it is much faster to write and read (eg. 23 MB in 0.6 seconds vs. 29 MB in 9.9 seconds).

This is what I did to serialize:

ObjectOutputStream oos = new ObjectOutputStream(new FileOutputStream("test.bin"));
oos.writeInt(map.size()); // write size of the map
for (Map.Entry<String, Double> entry : map.entrySet()) { // iterate entries
    System.out.println("writing ("+ entry.getKey() +","+ entry.getValue() +")");
    byte[] bytes = entry.getKey().getBytes();
    oos.writeInt(bytes.length); // length of key string
    oos.write(bytes); // key string bytes
    oos.writeDouble(entry.getValue()); // value
}
oos.close();

As you can see, I get the byte array for each key String, serialize its length and then the array itself. This is what I did to deserialize:

ObjectInputStream ois = new ObjectInputStream(new FileInputStream("test.bin"));
int size = ois.readInt(); // read size of the map
HashMap<String, Double> newMap = new HashMap<>(size);
for (int i = 0; i < size; i++) { // iterate entries
    int length = ois.readInt(); // length of key string
    byte[] bytes = new byte[length];
    ois.read(bytes); // key string bytes
    String key = new String(bytes);
    double value = ois.readDouble(); // value
    newMap.put(key, value);
    System.out.println("read ("+ key +","+ value +")");
}

The problem is that at some point the key is not serialized correctly. I've been debugging to the point where I could see that ois.read(bytes) read 8 bytes instead of 16 as it was supposed to, so the key String was not properly formed and the double value was read using the last 8 bytes from the key that were not read yet. In the end, Exceptions everywhere.

Using the sample data below, the output will be like this at some point:

read (2010-00-056.html,12154.250518054876)
read (2010-00-        ,1.4007397428546247E-76)
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at ti.Test.main(Test.java:82)

The problem can be seen in the serialized file (it should read 2010-00-008.html):

two bytes are added in between the String key. See MxyL's answer for further info about this. So it all boils down to: why are those two bytes added, and why readFully works ok?

Why isn't the String properly (de)serialized? Might it be some kind of padding to a fixed block size or something like that? Is there a better way to manually serialize a String when looking for efficiency? I was expecting some kind of writeString and readString, but seems there is no such thing in Java's ObjectStream.

I've been trying using buffered streams just in case there is something wrong there, explicitly saying how many bytes to write and to read, using different encodings, but no luck.

This is some sample data to reproduce the problem:

HashMap<String, Double> map = new HashMap<String, Double>();
map.put("2010-00-027.html",21732.994621513037); map.put("2010-00-020.html",3466.5169348296736); map.put("2010-00-051.html",12528.648992702407); map.put("2010-00-062.html",3354.8950010256385);
map.put("2010-00-024.html",10295.095511718278); map.put("2010-00-052.html",5381.513344679818);  map.put("2010-00-007.html",16466.33813960735);  map.put("2010-00-017.html",9484.969198176652);
map.put("2010-00-054.html",15423.873112634772); map.put("2010-00-022.html",8123.842752870753);  map.put("2010-00-033.html",21238.496665104063); map.put("2010-00-028.html",7578.792651786424);
map.put("2010-00-048.html",3566.4118233046393); map.put("2010-00-040.html",2681.0799941861724); map.put("2010-00-049.html",14308.090890746222); map.put("2010-00-058.html",5911.342406606804);
map.put("2010-00-045.html",2284.118716145881);  map.put("2010-00-031.html",2859.565771680721);  map.put("2010-00-046.html",4555.187022907964);  map.put("2010-00-036.html",8479.709295569426);
map.put("2010-00-061.html",846.8292195815125);  map.put("2010-00-023.html",14108.644025417952); map.put("2010-00-041.html",22686.232732684934); map.put("2010-00-025.html",9513.539663409734);
map.put("2010-00-012.html",459.6427911376829);  map.put("2010-00-005.html",0.0);    map.put("2010-00-013.html",2646.403220496738);  map.put("2010-00-065.html",5808.86423609936);
map.put("2010-00-056.html",12154.250518054876); map.put("2010-00-008.html",10811.15198506469);  map.put("2010-00-042.html",9271.006516004005);  map.put("2010-00-000.html",4387.4162586468965);
map.put("2010-00-059.html",4456.211623469774);  map.put("2010-00-055.html",3534.7511584735325); map.put("2010-00-057.html",8745.640098512009);  map.put("2010-00-032.html",4993.295735075575);
map.put("2010-00-021.html",3852.5805998017922); map.put("2010-00-043.html",4108.020033536286);  map.put("2010-00-053.html",2.2446400279239946); map.put("2010-00-030.html",17853.541210836203);

回答1:


ObjectOutputStream first write STREAM_MAGIC(0xaced) then write STREAM_VERSION(5) then write TC_BLOCKDATALONG (0x7A) then block size (1024) and for the last block, if length is less then 255, it wirte TC_BLOCKDATA (0x77) and block size (length of last block)

so when ObjectOutputStream using readFully, it first read data to a buffer which skip STREAM_MAGIC, STREAM_VERSION, then for every block, read block size to get the size then read all size data to buffer




回答2:


ois.read(bytes); // key string bytes

Change this to use readFully(). You're assuming the read filled the buffer. It isn't obliged to transfer more than one byte.

Is there a better way to manually serialize a String when looking for efficiency?

There is the writeUTF() and readUTF() pair.

You should note that by calling getBytes() you're introducing a platform dependency. You should specify the charset both here and when reconstructing the String.




回答3:


There are two things of interest to note here

First, if you took out the last 4 entries in your sample data, the error does not occur. That is, the two bytes are not erroneously added. Weird.

Second, if you open your file in a hex editor, and scroll down to the entry where the two extra bytes occur, you will see that it begins with a 4-byte integer that is correctly a value of 16 (keep in mind this is big-endian). Then you see your string with the two extra bytes, followed by the double associated with it.

Now, what's weird is how Java is reading those bytes. First, it reads the length of the string as you have instructed. It then tries to read 16 bytes...but here it appears to have failed to read 16 bytes, since your print statements show

read (2010-00-,1.3980409401811577E-76))

Now place the cursor right after those two weird bytes, and you'll see this

From where the string starts to where the pointer currently is, it seems to have only read 10 bytes.

Furthermore, when I tried to copy that line from my IDE's console, it only pasted

read (2010-00-

Usually when a string suddenly ends in my copy-paste I usually suspect null-bytes. Looking at my clipboard, indeed, it looks like the bytes weren't being read completely into the buffer:

Ok, so it looks like Java only managed to read 10 bytes and moved on, which explains the string and the number afterwards.

So it would appear that when you read and pass in a buffer, it doesn't get completely filled. There's even a recommendation from the tooltip itself that tells me to use readFully!

So doing a little testing, I went ahead and changed

ois.read(bytes); // key string bytes

to

ois.readFully(bytes, 0, length); // key string bytes

And for whatever reason, this works.

read (2010-00-013.html,2646.403220496738)
read (2010-00-005.html,0.0)
read (2010-00-056.html,12154.250518054876)
read (2010-00-008.html,10811.15198506469)
read (2010-00-042.html,9271.006516004005)
read (2010-00-000.html,4387.4162586468965)  // where it was failing before
read (2010-00-059.html,4456.211623469774)

Problem

Now, the fact that it actually worked is a problem. WHY does it work? It is pretty clear that there are two extra bytes in between your string (causing it to have a length of 18, not 16). It's not like the file has changed or anything.

Indeed, when I manually edited the file so that it only has three entries, and I indicate that there are only two, this is the output I get:

read (2010-00-056.html,12154.250518054876)
read (2010-00-wd008.ht,1.2466701288348126E219)

This is what I expect from a string with 18 bytes (well, maybe not that wd, I expected w,), but you specified that there are only 16. You should agree that the fact that using readFully actually worked, is weird.

So there are several mysteries

  1. Why are those two extra bytes added
  2. Why are they NOT added when you remove the last 4 entries (or more if you want)
  3. Why does using readFully work, all else constant?

Unfortunately, this answer doesn't answer your questions, and I'm also pretty stumped right now, not only by the problems you raised, but by the behaviors that I'm seeing.




回答4:


ObjectInputStream#read doesn't guarantee it'll read the buffer.length() number of bytes. When the read occurs on the edge of the current read ahead buffer block, it'll only return number of bytes remaining in the buffer. It should be written this way.

        int offset=0;
        while(offset<length) {
            int cnt=ois.read(bytes,offset, length-offset); // key string bytes
            offset+=cnt;
        }


来源:https://stackoverflow.com/questions/23944422/malformed-binary-serialization-of-hashmapstring-double

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!