Bad implementation of Enumerable.Single?

后端 未结 7 541
佛祖请我去吃肉
佛祖请我去吃肉 2020-12-15 03:19

I came across this implementation in Enumerable.cs by reflector.

public static TSource Single(this IEnumerable source, Func<         


        
相关标签:
7条回答
  • 2020-12-15 03:42

    Yes, I do find it slightly strange especially because the overload that doesn't take a predicate (i.e. works on just the sequence) does seem to have the quick-throw 'optimization'.


    In the BCL's defence however, I would say that the InvalidOperation exception that Single throws is a boneheaded exception that shouldn't normally be used for control-flow. It's not necessary for such cases to be optimized by the library.

    Code that uses Single where zero or multiple matches is a perfectly valid possibility, such as:

    try
    {
         var item = source.Single(predicate);
         DoSomething(item);
    }
    
    catch(InvalidOperationException)
    {
         DoSomethingElseUnexceptional();    
    }
    

    should be refactored to code that doesn't use the exception for control-flow, such as (only a sample; this can be implemented more efficiently):

    var firstTwo = source.Where(predicate).Take(2).ToArray();
    
    if(firstTwo.Length == 1) 
    {
        // Note that this won't fail. If it does, this code has a bug.
        DoSomething(firstTwo.Single()); 
    }
    else
    {
        DoSomethingElseUnexceptional();
    }
    

    In other words, we should leave the use of Single to cases when we expect the sequence to contain only one match. It should behave identically to Firstbut with the additional run-time assertion that the sequence doesn't contain multiple matches. Like any other assertion, failure, i.e. cases when Single throws, should be used to represent bugs in the program (either in the method running the query or in the arguments passed to it by the caller).

    This leaves us with two cases:

    1. The assertion holds: There is a single match. In this case, we want Single to consume the entire sequence anyway to assert our claim. There's no benefit to the 'optimization'. In fact, one could argue that the sample implementation of the 'optimization' provided by the OP will actually be slower because of the check on every iteration of the loop.
    2. The assertion fails: There are zero or multiple matches. In this case, we do throw later than we could, but this isn't such a big deal since the exception is boneheaded: it is indicative of a bug that must be fixed.

    To sum up, if the 'poor implementation' is biting you performance-wise in production, either:

    1. You are using Single incorrectly.
    2. You have a bug in your program. Once the bug is fixed, this particular performance problem will go away.

    EDIT: Clarified my point.

    EDIT: Here's a valid use of Single, where failure indicates bugs in the calling code (bad argument):

    public static User GetUserById(this IEnumerable<User> users, string id)
    {
         if(users == null)
            throw new ArgumentNullException("users");
    
         // Perfectly fine if documented that a failure in the query
         // is treated as an exceptional circumstance. Caller's job 
         // to guarantee pre-condition.        
         return users.Single(user => user.Id == id);    
    }
    
    0 讨论(0)
  • 2020-12-15 03:49

    When considering this implementation we must remember that this is the BCL: general code that is supposed to work good enough in all sorts of scenarios.

    First, take these scenarios:

    1. Iterate over 10 numbers, where the first and second elements are equal
    2. Iterate over 1.000.000 numbers, where the first and third elements are equal

    The original algorithm will work well enough for 10 items, but 1M will have a severe waste of cycles. So in these cases where we know that there are two or more early in the sequences, the proposed optimization would have a nice effect.

    Then, look at these scenarios:

    1. Iterate over 10 numbers, where the first and last elements are equal
    2. Iterate over 1.000.000 numbers, where the first and last elements are equal

    In these scenarios the algorithm is still required to inspect every item in the lists. There is no shortcut. The original algorithm will perform good enough, it fulfills the contract. Changing the algorithm, introducing an if on each iteration will actually decrease performance. For 10 items it will be negligible, but 1M it will be a big hit.

    IMO, the original implementation is the correct one, since it is good enough for most scenarios. Knowing the implementation of Single is good though, because it enables us to make smart decisions based on what we know about the sequences we use it on. If performance measurements in one particular scenario shows that Single is causing a bottleneck, well: then we can implement our own variant that works better in that particular scenario.

    Update: as CodeInChaos and Eamon have correctly pointed out, the if test introduced in the optimization is indeed not performed on each item, only within the predicate match block. I have in my example completely overlooked the fact that the proposed changes will not affect the overall performance of the implementation.

    I agree that introducing the optimization would probably benefit all scenarios. It is good to see though that eventually, the decision to implement the optimization is made on the basis of performance measurements.

    0 讨论(0)
  • 2020-12-15 03:54

    It seems very clear to me.

    Single is intended for the case where the caller knows that the enumeration contains exactly one match, since in any other case an expensive exception is thrown.

    For this use case, the overload that takes a predicate must iterate over the whole enumeration. It is slightly faster to do so without an additional condition on every loop.

    In my view the current implementation is correct: it is optimized for the expected use case of an enumeration that contains exactly one matching element.

    0 讨论(0)
  • 2020-12-15 04:00

    I think it's a premature optimization "bug".

    Why this is NOT reasonable behavior due to side effects

    Some have argued that due to side effects, it should be expected that the entire list is evaluated. After all, in the correct case (the sequence indeed has just 1 element) it is completely enumerated, and for consistency with this normal case it's nicer to enumerate the entire sequence in all cases.

    Although that's a reasonable argument, it flies in the face of the general practice throughout the LINQ libraries: they use lazy evaluation everywhere. It's not general practice to fully enumerate sequences except where absolutely necessary; indeed, several methods prefer using IList.Count when available over any iteration at all - even when that iteration may have side effects.

    Further, .Single() without predicate does not exhibit this behavior: that terminates as soon as possible. If the argument were that .Single() should respect side-effects of enumeration, you'd expect all overloads to do so equivalently.

    Why the case for speed doesn't hold

    Peter Lillevold made the interesting observation that it may be faster to do...

    foreach(var elem in elems)
        if(pred(elem)) {
            retval=elem;
            count++;
        }
    if(count!=1)...
    

    than

    foreach(var elem in elems)
        if(pred(elem)) {
            retval=elem;
            count++;
            if(count>1) ...
        }
    if(count==0)...
    

    After all, the second version, which would exit the iteration as soon as the first conflict is detected, would require an extra test in the loop - a test which in the "correct" is purely ballast. Neat theory, right?

    Except, that's not bourne out by the numbers; for example on my machine (YMMV) Enumerable.Range(0,100000000).Where(x=>x==123).Single() is actually faster than Enumerable.Range(0,100000000).Single(x=>x==123)!

    It's possibly a JITter quirk of this precise expression on this machine - I'm not claiming that Where followed by predicateless Single is always faster.

    But whatever the case, the fail-fast solution is very unlikely to be significantly slower. After all, even in the normal case, we're dealing with a cheap branch: a branch that is never taken and thus easy on the branch predictor. And of course; the branch is further only ever encountered when pred holds - that's once per call in the normal case. That cost is simply negligible compared to the cost of the delegate call pred and its implementation, plus the cost of the interface methods .MoveNext() and .get_Current() and their implementations.

    It's simply extremely unlikely that you'll notice the performance degradation caused by one predictable branch in comparison to all that other abstraction penalty - not to mention the fact that most sequences and predicates actually do something themselves.

    0 讨论(0)
  • 2020-12-15 04:02

    That does appear to be a bad implementation, in my opinion.

    Just to illustrate the potential severity of the problem:

    var oneMillion = Enumerable.Range(1, 1000000)
                               .Select(x => { Console.WriteLine(x); return x; });
    
    int firstEven = oneMillion.Single(x => x % 2 == 0);
    

    The above will output all the integers from 1 to 1000000 before throwing an exception.

    It's a head-scratcher for sure.

    0 讨论(0)
  • 2020-12-15 04:05

    I only found this question after filing a report at https://connect.microsoft.com/VisualStudio/feedback/details/810457/public-static-tsource-single-tsource-this-ienumerable-tsource-source-func-tsource-bool-predicate-doesnt-throw-immediately-on-second-matching-result#

    The side-effect argument doesn't hold water, because:

    1. Having side-effects isn't really functional, and they're called Func for a reason.
    2. If you do want side-effects, it makes no more sense to claim the version that has the side-effects throughout the whole sequence is desirable than it does to claim so for the version that throws immediately.
    3. It does not match the behaviour of First or the other overload of Single.
    4. It does not match at least some other implementations of Single, e.g. Linq2SQL uses TOP 2 to ensure that only the two matching cases needed to test for more than one match are returned.
    5. We can construct cases where we should expect a program to halt, but it does not halt.
    6. We can construct cases where OverflowException is thrown, which is not documented behaviour, and hence clearly a bug.

    Most importantly of all, if we're in a condition where we expected the sequence to have only one matching element, and yet we're not, then something has clearly gone wrong. Apart from the general principle that the only thing you should do upon detecting an error state is clean-up (and this implementation delays that) before throwing, the case of an sequence having more than one matching element is going to overlap with the case of a sequence having more elements in total than expected - perhaps because the sequence has a bug that is causing it to loop unexpectedly. So it's precisely in one possible set of bugs that should trigger the exception, that the exception is most delayed.

    Edit:

    Peter Lillevold's mention of a repeated test may be a reason why the author chose to take the approach they did, as an optimisation for the non-exceptional case. If so it was needless though, even aside from Eamon Nerbonne showing it wouldn't improve much. There's no need to have a repeated test in the initial loop, as we can just change what we're testing for upon the first match:

    public static TSource Single<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate)
    {
      if(source == null)
        throw new ArgumentNullException("source");
      if(predicate == null)
        throw new ArgumentNullException("predicate");
      using(IEnumerator<TSource> en = source.GetEnumerator())
      {
        while(en.MoveNext())
        {
          TSource cur = en.Current;
          if(predicate(cur))
          {
            while(en.MoveNext())
              if(predicate(en.Current))
                throw new InvalidOperationException("Sequence contains more than one matching element");
           return cur;
          }
        }
      }
      throw new InvalidOperationException("Sequence contains no matching element");
    }
    
    0 讨论(0)
提交回复
热议问题