Hashmap slower after deserialization - Why?

前端 未结 2 2016
佛祖请我去吃肉
佛祖请我去吃肉 2021-01-01 03:44

I have a pretty large Hashmap (~250MB). Creating it takes about 50-55 seconds, so I decided to serialize it and save it to a file. Reading from the file takes about 16-17 se

相关标签:
2条回答
  • 2021-01-01 03:55

    300 million lookups take about 3.1 seconds when I create the hashmap myself, and about 8.5 seconds when I read the same hashmap from file. Does anybody have an idea why? Am I overlooking something obvious?

    One possible cause is that the reconstructed HashMap may not have the same capacity (number of buckets) as the original one, which might increase the frequency of hash collisions or (if the size is increased) decrease locality of main memory access (resulting in more cache misses). To verify, use a debugger to inspect the length of map.table before and after reconstruction. If this is indeed the case, try copying the data into a new HashMap with an appropriate loadFactor.

    As for why serialization does not maintain capacity: HashMap customizes its serialization format (it makes no sense to serialize null for every empty table element) by providing writeObject and readObject methods, and ignores the capacity it finds in the input stream:

    /**
     * Reconstitute the {@code HashMap} instance from a stream (i.e.,
     * deserialize it).
     */
    private void readObject(java.io.ObjectInputStream s)
        throws IOException, ClassNotFoundException {
        // Read in the threshold (ignored), loadfactor, and any hidden stuff
        s.defaultReadObject();
        reinitialize();
        if (loadFactor <= 0 || Float.isNaN(loadFactor))
            throw new InvalidObjectException("Illegal load factor: " +
                                             loadFactor);
        s.readInt();                // Read and ignore number of buckets
        int mappings = s.readInt(); // Read number of mappings (size)
        if (mappings < 0)
            throw new InvalidObjectException("Illegal mappings count: " +
                                             mappings);
        else if (mappings > 0) { // (if zero, use defaults)
            // Size the table using given load factor only if within
            // range of 0.25...4.0
            float lf = Math.min(Math.max(0.25f, loadFactor), 4.0f);
            float fc = (float)mappings / lf + 1.0f;
            int cap = ((fc < DEFAULT_INITIAL_CAPACITY) ?
                       DEFAULT_INITIAL_CAPACITY :
                       (fc >= MAXIMUM_CAPACITY) ?
                       MAXIMUM_CAPACITY :
                       tableSizeFor((int)fc));
            float ft = (float)cap * lf;
            threshold = ((cap < MAXIMUM_CAPACITY && ft < MAXIMUM_CAPACITY) ?
                         (int)ft : Integer.MAX_VALUE);
            @SuppressWarnings({"rawtypes","unchecked"})
                Node<K,V>[] tab = (Node<K,V>[])new Node[cap];
            table = tab;
    
            // Read the keys and values, and put the mappings in the HashMap
            for (int i = 0; i < mappings; i++) {
                @SuppressWarnings("unchecked")
                    K key = (K) s.readObject();
                @SuppressWarnings("unchecked")
                    V value = (V) s.readObject();
                putVal(hash(key), key, value, false, false);
            }
        }
    }
    

    I suspect it ignores the number of buckets to prevent a denial of service attack where an attacker crafts a serialization stream, and gives an unrealistically high (or low) number of buckets, which would cause an OutOfMemoryError (or excessive CPU load due to hash collisions), which would be a cheap way to do a denial of service attack against any application accepting serialized data from unstrusted sources (CVE-2012-2739 describes such an issue).

    0 讨论(0)
  • 2021-01-01 04:03

    This question was interesting, so I wrote my own test case to verify it. I found no difference in speed for a live lookup Vs one that was loaded from a serialized file. The program is available at the end of the post for anyone interested in running it.

    • The methods were monitored using JProfiler.
    • The serialized file is comparable to yours. ~ 230 MB.
    • Lookups in memory cost 1210 ms without any serialization

    enter image description here

    • After serializing the map and reading them again, the cost of lookups remained the same (well almost - 1224 ms)

    enter image description here

    • The profiler was tweaked to add minimal overhead in both scenarios.
    • This was measured on Java(TM) SE Runtime Environment (build 1.6.0_25-b06) / 4 CPUs running at 1.7 Ghz / 4GB Ram 800 Mhz

    Measuring is tricky. I myself noticed the 8 second lookup time that you described, but guess what else I noticed when that happened.

    GC activity

    enter image description here

    Your measurements are probably picking that up too. If you isolate the measurements of Map.get() alone you'll see that the results are comparable.


    public class GenericTest
    {
        public static void main(String... args)
        {
            // Call the methods as you please for a live Vs ser <-> de_ser run
        }
    
        private static Map<Long, Integer> generateHashMap()
        {
            Map<Long, Integer> map = new HashMap<Long, Integer>();
            final Random random = new Random();
            for(int counter = 0 ; counter < 10000000 ; counter++)
            {
                final int value = random.nextInt();
                final long key = random.nextLong();
                map.put(key, value);
            }
            return map;
        }
    
        private static void lookupItems(int n, Map<Long, Integer> map)
        {
            final Random random = new Random();
            for(int counter = 0 ; counter < n ; counter++)
            {
                final long key = random.nextLong();
                final Integer value = map.get(key);
            }
        }
    
        private static void serialize(Map<Long, Integer> map)
        {
            try
            {
                File file = new File("temp/omaha.ser");
                FileOutputStream f = new FileOutputStream(file);
                ObjectOutputStream s = new ObjectOutputStream(new BufferedOutputStream(f));
                s.writeObject(map);
                s.close();
            }
            catch (Exception e)
            {
                e.printStackTrace();
            }
        }
    
        private static Map<Long, Integer> deserialize()
        {
            try
            {
                File file = new File("temp/omaha.ser");
                FileInputStream f = new FileInputStream(file);
                ObjectInputStream s = new ObjectInputStream(new BufferedInputStream(f));
                HashMap<Long, Integer> map = (HashMap<Long, Integer>) s.readObject();
                s.close();
                return map;
            }
            catch (Exception e)
            {
                throw new RuntimeException(e);
            }
        }
    }
    
    0 讨论(0)
提交回复
热议问题