Hashmap slower after deserialization - Why?

a 夏天 提交于 2019-12-18 13:41:12

问题


I have a pretty large Hashmap (~250MB). Creating it takes about 50-55 seconds, so I decided to serialize it and save it to a file. Reading from the file takes about 16-17 seconds now.

The only problem is that lookups seems to be slower this way. I always thought that the hashmap is read from the file into the memory, so the performance should be the same compared to the case when I create the hashmap myself, right? Here is the code I am using to read the hashmap into a file:

File file = new File("omaha.ser");
FileInputStream f = new FileInputStream(file);
ObjectInputStream s = new ObjectInputStream(new BufferedInputStream(f));
omahaMap = (HashMap<Long, Integer>) s.readObject();
s.close();

300 million lookups take about 3.1 seconds when I create the hashmap myself, and about 8.5 seconds when I read the same hashmap from file. Does anybody have an idea why? Am I overlooking something obvious?

EDIT:

I "measured" the time by just taking the time with System.nanotime(), so no proper benchmark method used. Here is the code:

public class HandEvaluationTest
{
    public static void Test()
    {

        HandEvaluation.populate5Card();
        HandEvaluation.populate9CardOmaha();


        Card[] player1cards = {new Card("4s"), new Card("2s"), new Card("8h"), new Card("4d")};
        Card[] player2cards = {new Card("As"), new Card("9s"), new Card("6c"), new Card("2h")};
        Card[] player3cards = {new Card("9h"), new Card("7h"), new Card("Kc"), new Card("Kh")};
        Card[] table = {new Card("2d"), new Card("2c"), new Card("3c"), new Card("5c"), new Card("4h")};


        int j=0, k=0, l=0;
        long startTime = System.nanoTime();
        for(int p=0; p<100000000; p++)    {
           j = HandEvaluation.handEval9Hash(player1cards, table);
            k = HandEvaluation.handEval9Hash(player2cards, table);
            l = HandEvaluation.handEval9Hash(player3cards, table);

        }
        long estimatedTime = System.nanoTime() - startTime;
        System.out.println("Time needed: " + estimatedTime*Math.pow(10,-6) + "ms");
        System.out.println("Handstrength Player 1: " + j);
        System.out.println("Handstrength Player 2: " + k);
        System.out.println("Handstrength Player 3: " + l);
    }
}

The big hashmap work is done in HandEvaluation.populate9CardOmaha(). The 5-card one is small. The code for the big one:

 public static void populate9CardOmaha()
        {

            //Check if the hashmap is already there- then just read it and exit
            File hashmap = new File("omaha.ser");
            if(hashmap.exists())
            {
                try
                {
                    File file = new File("omaha.ser");
                    FileInputStream f = new FileInputStream(file);
                    ObjectInputStream s = new ObjectInputStream(new BufferedInputStream(f));
                    omahaMap = (HashMap<Long, Integer>) s.readObject();
                    s.close();
                }
                catch(IOException ioex) {ioex.printStackTrace();}
                catch(ClassNotFoundException cnfex)
                {
                    System.out.println("Class not found");
                    cnfex.printStackTrace();
                    return;
                }
                return;
            }

    // if it's not there, populate it yourself
    ... Code for populating hashmap ...
    // and then save it to file
          (

            try
            {
                File file = new File("omaha.ser");
                FileOutputStream f = new FileOutputStream(file);
                ObjectOutputStream s = new ObjectOutputStream(new BufferedOutputStream(f));
                s.writeObject(omahaMap);
                s.close();
            }
            catch(IOException ioex) {ioex.printStackTrace();}
        }

When i am populating it myself (= file is not here), lookups in the HandEvaluationTest.Test() take about 8 seconds instead of 3. Maybe it's just my very naive way of measuring the time elapsed?


回答1:


This question was interesting, so I wrote my own test case to verify it. I found no difference in speed for a live lookup Vs one that was loaded from a serialized file. The program is available at the end of the post for anyone interested in running it.

  • The methods were monitored using JProfiler.
  • The serialized file is comparable to yours. ~ 230 MB.
  • Lookups in memory cost 1210 ms without any serialization

  • After serializing the map and reading them again, the cost of lookups remained the same (well almost - 1224 ms)

  • The profiler was tweaked to add minimal overhead in both scenarios.
  • This was measured on Java(TM) SE Runtime Environment (build 1.6.0_25-b06) / 4 CPUs running at 1.7 Ghz / 4GB Ram 800 Mhz

Measuring is tricky. I myself noticed the 8 second lookup time that you described, but guess what else I noticed when that happened.

GC activity

Your measurements are probably picking that up too. If you isolate the measurements of Map.get() alone you'll see that the results are comparable.


public class GenericTest
{
    public static void main(String... args)
    {
        // Call the methods as you please for a live Vs ser <-> de_ser run
    }

    private static Map<Long, Integer> generateHashMap()
    {
        Map<Long, Integer> map = new HashMap<Long, Integer>();
        final Random random = new Random();
        for(int counter = 0 ; counter < 10000000 ; counter++)
        {
            final int value = random.nextInt();
            final long key = random.nextLong();
            map.put(key, value);
        }
        return map;
    }

    private static void lookupItems(int n, Map<Long, Integer> map)
    {
        final Random random = new Random();
        for(int counter = 0 ; counter < n ; counter++)
        {
            final long key = random.nextLong();
            final Integer value = map.get(key);
        }
    }

    private static void serialize(Map<Long, Integer> map)
    {
        try
        {
            File file = new File("temp/omaha.ser");
            FileOutputStream f = new FileOutputStream(file);
            ObjectOutputStream s = new ObjectOutputStream(new BufferedOutputStream(f));
            s.writeObject(map);
            s.close();
        }
        catch (Exception e)
        {
            e.printStackTrace();
        }
    }

    private static Map<Long, Integer> deserialize()
    {
        try
        {
            File file = new File("temp/omaha.ser");
            FileInputStream f = new FileInputStream(file);
            ObjectInputStream s = new ObjectInputStream(new BufferedInputStream(f));
            HashMap<Long, Integer> map = (HashMap<Long, Integer>) s.readObject();
            s.close();
            return map;
        }
        catch (Exception e)
        {
            throw new RuntimeException(e);
        }
    }
}



回答2:


300 million lookups take about 3.1 seconds when I create the hashmap myself, and about 8.5 seconds when I read the same hashmap from file. Does anybody have an idea why? Am I overlooking something obvious?

One possible cause is that the reconstructed HashMap may not have the same capacity (number of buckets) as the original one, which might increase the frequency of hash collisions or (if the size is increased) decrease locality of main memory access (resulting in more cache misses). To verify, use a debugger to inspect the length of map.table before and after reconstruction. If this is indeed the case, try copying the data into a new HashMap with an appropriate loadFactor.

As for why serialization does not maintain capacity: HashMap customizes its serialization format (it makes no sense to serialize null for every empty table element) by providing writeObject and readObject methods, and ignores the capacity it finds in the input stream:

/**
 * Reconstitute the {@code HashMap} instance from a stream (i.e.,
 * deserialize it).
 */
private void readObject(java.io.ObjectInputStream s)
    throws IOException, ClassNotFoundException {
    // Read in the threshold (ignored), loadfactor, and any hidden stuff
    s.defaultReadObject();
    reinitialize();
    if (loadFactor <= 0 || Float.isNaN(loadFactor))
        throw new InvalidObjectException("Illegal load factor: " +
                                         loadFactor);
    s.readInt();                // Read and ignore number of buckets
    int mappings = s.readInt(); // Read number of mappings (size)
    if (mappings < 0)
        throw new InvalidObjectException("Illegal mappings count: " +
                                         mappings);
    else if (mappings > 0) { // (if zero, use defaults)
        // Size the table using given load factor only if within
        // range of 0.25...4.0
        float lf = Math.min(Math.max(0.25f, loadFactor), 4.0f);
        float fc = (float)mappings / lf + 1.0f;
        int cap = ((fc < DEFAULT_INITIAL_CAPACITY) ?
                   DEFAULT_INITIAL_CAPACITY :
                   (fc >= MAXIMUM_CAPACITY) ?
                   MAXIMUM_CAPACITY :
                   tableSizeFor((int)fc));
        float ft = (float)cap * lf;
        threshold = ((cap < MAXIMUM_CAPACITY && ft < MAXIMUM_CAPACITY) ?
                     (int)ft : Integer.MAX_VALUE);
        @SuppressWarnings({"rawtypes","unchecked"})
            Node<K,V>[] tab = (Node<K,V>[])new Node[cap];
        table = tab;

        // Read the keys and values, and put the mappings in the HashMap
        for (int i = 0; i < mappings; i++) {
            @SuppressWarnings("unchecked")
                K key = (K) s.readObject();
            @SuppressWarnings("unchecked")
                V value = (V) s.readObject();
            putVal(hash(key), key, value, false, false);
        }
    }
}

I suspect it ignores the number of buckets to prevent a denial of service attack where an attacker crafts a serialization stream, and gives an unrealistically high (or low) number of buckets, which would cause an OutOfMemoryError (or excessive CPU load due to hash collisions), which would be a cheap way to do a denial of service attack against any application accepting serialized data from unstrusted sources (CVE-2012-2739 describes such an issue).



来源:https://stackoverflow.com/questions/28525956/hashmap-slower-after-deserialization-why

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!