Why are HashSets of structs with nullable values incredibly slow?

前端 未结 2 823
自闭症患者
自闭症患者 2020-12-24 10:41

I investigated performance degradation and tracked it down to slow HashSets.
I have structs with nullable values that are used as a primary key. For example:

<         


        
相关标签:
2条回答
  • 2020-12-24 10:50

    This is due to struct GetHashCode() behavior. If it finds reference types - it tries to get hash from first non-reference type field. In your case it WAS found, and Nullable<> is also struct, so it just poped it's private boolean value (4 bytes)

    0 讨论(0)
  • 2020-12-24 11:02

    This is happening because every one of the elements of _nullableWrappers has the same hash code returned by GetHashCode(), which is resulting in the hashing degenerating into O(N) access rather than O(1).

    You can verify this by printing out all the hash codes.

    If you modify your struct as so:

    public struct NullableLongWrapper
    {
        private readonly long? _value;
    
        public NullableLongWrapper(long? value)
        {
            _value = value;
        }
    
        public override int GetHashCode()
        {
            return _value.GetHashCode();
        }
    
        public long? Value => _value;
    }
    

    it works much more quickly.

    Now, the obvious question is WHY is the hash code of every NullableLongWrapper the same.

    The answer to that is discussed in this thread. However, it doesn't quite answer the question, since Hans' answer revolves around the struct having TWO fields from which to choose when computing the hash code - but in this code, there's only one field to choose from - and it's a value type (a struct).

    However, the moral of this story is: Never rely on the default GetHashCode() for value types!


    Addendum

    I thought that perhaps what was happening was related to Hans' answer in the thread I linked - maybe it was taking the value of the first field (the bool) in the Nullable<T> struct), and my experiments indicate that it may be related - but it's complicated:

    Consider this code and its output:

    using System;
    
    public class Program
    {
        static void Main()
        {
            var a = new Test {A = 0, B = 0};
            var b = new Test {A = 1, B = 0};
            var c = new Test {A = 0, B = 1};
            var d = new Test {A = 0, B = 2};
            var e = new Test {A = 0, B = 3};
    
            Console.WriteLine(a.GetHashCode());
            Console.WriteLine(b.GetHashCode());
            Console.WriteLine(c.GetHashCode());
            Console.WriteLine(d.GetHashCode());
            Console.WriteLine(e.GetHashCode());
        }
    }
    
    public struct Test
    {
        public int A;
        public int B;
    }
    
    Output:
    
    346948956
    346948957
    346948957
    346948958
    346948959
    

    Note how the second and third hash codes (for 1/0 and 0/1) are the same, but the others are all different. I find this strange because clearly changing A changes the hash code, as does changing B, but given two values X and Y, the same hash code is generated for A=X, B=Y and A=Y, B=X.

    (That sounds like some XOR stuff is happening behind the scenes, but that's guess.)

    Incidentally, this behaviour where BOTH fields can be shown to contribute to the hash code proves that the comment in the reference source for ValueType.GetHashType() is inaccurate or wrong:

    Action: Our algorithm for returning the hashcode is a little bit complex. We look for the first non-static field and get it's hashcode. If the type has no non-static fields, we return the hashcode of the type. We can't take the hashcode of a static member because if that member is of the same type as the original type, we'll end up in an infinite loop.

    If that comment was true, then four of the five hash codes in the example above would be the same, since A has the same value, 0, for all those. (That assumes A is the first field, but you get the same results if you swap the values around: Both fields clearly contribute to the hash code.)

    Then I tried changing the first field to be a bool:

    using System;
    
    public class Program
    {
        static void Main()
        {
            var a = new Test {A = false, B = 0};
            var b = new Test {A = true,  B = 0};
            var c = new Test {A = false, B = 1};
            var d = new Test {A = false, B = 2};
            var e = new Test {A = false, B = 3};
    
            Console.WriteLine(a.GetHashCode());
            Console.WriteLine(b.GetHashCode());
            Console.WriteLine(c.GetHashCode());
            Console.WriteLine(d.GetHashCode());
            Console.WriteLine(e.GetHashCode());
        }
    }
    
    public struct Test
    {
        public bool A;
        public int  B;
    }
    
    Output
    
    346948956
    346948956
    346948956
    346948956
    346948956
    

    Wow! So making the first field a bool makes all the hash codes come out the same, regardless of the values of ANY of the fields!

    This still looks like some kind of bug to me.

    The bug has been fixed in .NET 4, but only for Nullable. Custom types still yield the bad behavior. source

    0 讨论(0)
提交回复
热议问题