Need help understanding unexpected behavior using LINQ Join with HashSet<T>

筅森魡賤 提交于 2019-12-02 09:33:57

Your bug is almost certainly somewhere in the vast amount of code you did not show in the question. My advice is that you simplify your program down to the simplest possible program that produces the bug. In doing so, either you will find your bug, or you will produce a program that is so simple that you can post all of it in your question and then we can analyze it.

Assuming the behavior is correct, I would greatly appreciate someone explaining the underlying logic behind this (unexpected) Join/HashSet behavior.

Since I do not know what the unexpected behaviour is, I cannot say why it happens. I can however say precisely what Join does, and perhaps that will help.

Join takes the following:

  • An "outer" collection -- the receiver of the Join.
  • An "inner" collection -- the first argument of the extension method
  • Two key extractors, that extract a key from the outer and inner collections
  • A projection, that takes a member of the inner and outer collections whose keys match, and produces the result for that match
  • A comparison operation that compares two keys for equality.

Here's how Join works. (This is logically what happens; the actual implementation details are somewhat optimized.)

First, we iterate over the "inner" collection, exactly once.

For each element of the inner collection, we extract its key, and we form a multi-dictionary that maps from the key to the set of all elements in the inner collection where the key selector produced that key. Keys are compared for equality using the supplied comparison.

Thus, we now have a lookup from TKey to IEnumerable<TInner>.

Second, we iterate over the "outer" collection, exactly once.

For each element of the outer collection, we extract its key, and do a lookup in the multi-dictionary for that key, again, using the supplied key comparison.

We then do a nested loop on each matching element of the inner collection, call the projection on the outer/inner pair, and yield the result.

That is, Join behaves like this pseudocode implementation:

static IEnumerable<TResult> Join<TOuter, TInner, TKey, TResult>
  (IEnumerable<TOuter> outer, 
  IEnumerable<TInner> inner, 
  Func<TOuter, TKey> outerKeySelector, 
  Func<TInner, TKey> innerKeySelector, 
  Func<TOuter, TInner, TResult> resultSelector, 
  IEqualityComparer<TKey> comparer) 
{
  var lookup = new SomeMultiDictionary<TKey, TInner>(comparer);
  foreach(TInner innerItem in inner)
  {
    TKey innerKey = innerKeySelector(innerItem);
    lookup.Add(innerItem, innerKey);
  }
  foreach (TOuter outerItem in outer) 
  {
    TKey outerKey = outerKeySelector(outerItem);
    foreach(TInner innerItem in lookup[outerKey])
    {
      TResult result = resultSelector(outerItem, innerItem);
      yield return result;
    }
  }
}

Some suggestions:

  • Replace all your GetHashCode implementations so that they return 0, and run all your tests. They should pass! It is always legal to return zero from GetHashCode. Doing so will almost certainly wreck your performance, but it must not wreck your correctness. If you are in a situation where you require a particular non-zero value of GetHashCode, then you have a bug.
  • Test your key comparison to ensure that it is a valid comparison. It must obey the three rules of equality: (1) reflexivity: a thing always equals itself, (2) symmetry: the equality of A and B must be the same as B and A, (3) transitivity: if A equals B and B equals C then A must equal C. If these rules are not met then Join can behave weirdly.
  • Replace your Join with a SelectMany and a Where. That is:

    from o in outer join i in inner on getOuterKey(o) equals getInnerKey(i) select getResult(o, i)

can be rewritten as

from o in outer
from i in inner
where keyEquality(getOuterKey(o), getInnerKey(i))
select getResult(o, i)

That query is slower than the join version, but it is logically exactly the same. Again, run your tests. Do you get the same result? If not, you have a bug somewhere in your logic.

Again, I cannot emphasize strongly enough that your attitude that "Join is probably broken when given a hash table" is what is holding you back from finding your bug. Join isn't broken. That code hasn't changed in a decade, it is very simple, and it was right when we wrote it the first time. Much more likely is that your complicated and arcane key comparison logic is broken somewhere.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!