I came across this implementation in Enumerable.cs by reflector.
public static TSource Single(this IEnumerable source, Func<
Yes, I do find it slightly strange especially because the overload that doesn't take a predicate (i.e. works on just the sequence) does seem to have the quick-throw 'optimization'.
In the BCL's defence however, I would say that the InvalidOperation exception that Single throws is a boneheaded exception that shouldn't normally be used for control-flow. It's not necessary for such cases to be optimized by the library.
Code that uses Single
where zero or multiple matches is a perfectly valid possibility, such as:
try
{
var item = source.Single(predicate);
DoSomething(item);
}
catch(InvalidOperationException)
{
DoSomethingElseUnexceptional();
}
should be refactored to code that doesn't use the exception for control-flow, such as (only a sample; this can be implemented more efficiently):
var firstTwo = source.Where(predicate).Take(2).ToArray();
if(firstTwo.Length == 1)
{
// Note that this won't fail. If it does, this code has a bug.
DoSomething(firstTwo.Single());
}
else
{
DoSomethingElseUnexceptional();
}
In other words, we should leave the use of Single
to cases when we expect the sequence to contain only one match. It should behave identically to First
but with the additional run-time assertion that the sequence doesn't contain multiple matches. Like any other assertion, failure, i.e. cases when Single
throws, should be used to represent bugs in the program (either in the method running the query or in the arguments passed to it by the caller).
This leaves us with two cases:
Single
to consume the entire sequence anyway to assert our claim. There's no benefit to the 'optimization'. In fact, one could argue that the sample implementation of the 'optimization' provided by the OP will actually be slower because of the check on every iteration of the loop.To sum up, if the 'poor implementation' is biting you performance-wise in production, either:
Single
incorrectly.EDIT: Clarified my point.
EDIT: Here's a valid use of Single, where failure indicates bugs in the calling code (bad argument):
public static User GetUserById(this IEnumerable<User> users, string id)
{
if(users == null)
throw new ArgumentNullException("users");
// Perfectly fine if documented that a failure in the query
// is treated as an exceptional circumstance. Caller's job
// to guarantee pre-condition.
return users.Single(user => user.Id == id);
}
When considering this implementation we must remember that this is the BCL: general code that is supposed to work good enough in all sorts of scenarios.
First, take these scenarios:
The original algorithm will work well enough for 10 items, but 1M will have a severe waste of cycles. So in these cases where we know that there are two or more early in the sequences, the proposed optimization would have a nice effect.
Then, look at these scenarios:
In these scenarios the algorithm is still required to inspect every item in the lists. There is no shortcut. The original algorithm will perform good enough, it fulfills the contract. Changing the algorithm, introducing an if
on each iteration will actually decrease performance. For 10 items it will be negligible, but 1M it will be a big hit.
IMO, the original implementation is the correct one, since it is good enough for most scenarios. Knowing the implementation of Single
is good though, because it enables us to make smart decisions based on what we know about the sequences we use it on. If performance measurements in one particular scenario shows that Single
is causing a bottleneck, well: then we can implement our own variant that works better in that particular scenario.
Update: as CodeInChaos and Eamon have correctly pointed out, the if
test introduced in the optimization is indeed not performed on each item, only within the predicate match block. I have in my example completely overlooked the fact that the proposed changes will not affect the overall performance of the implementation.
I agree that introducing the optimization would probably benefit all scenarios. It is good to see though that eventually, the decision to implement the optimization is made on the basis of performance measurements.
It seems very clear to me.
Single
is intended for the case where the caller knows that the enumeration contains exactly one match, since in any other case an expensive exception is thrown.
For this use case, the overload that takes a predicate must iterate over the whole enumeration. It is slightly faster to do so without an additional condition on every loop.
In my view the current implementation is correct: it is optimized for the expected use case of an enumeration that contains exactly one matching element.
I think it's a premature optimization "bug".
Some have argued that due to side effects, it should be expected that the entire list is evaluated. After all, in the correct case (the sequence indeed has just 1 element) it is completely enumerated, and for consistency with this normal case it's nicer to enumerate the entire sequence in all cases.
Although that's a reasonable argument, it flies in the face of the general practice throughout the LINQ libraries: they use lazy evaluation everywhere. It's not general practice to fully enumerate sequences except where absolutely necessary; indeed, several methods prefer using IList.Count
when available over any iteration at all - even when that iteration may have side effects.
Further, .Single()
without predicate does not exhibit this behavior: that terminates as soon as possible. If the argument were that .Single()
should respect side-effects of enumeration, you'd expect all overloads to do so equivalently.
Peter Lillevold made the interesting observation that it may be faster to do...
foreach(var elem in elems)
if(pred(elem)) {
retval=elem;
count++;
}
if(count!=1)...
than
foreach(var elem in elems)
if(pred(elem)) {
retval=elem;
count++;
if(count>1) ...
}
if(count==0)...
After all, the second version, which would exit the iteration as soon as the first conflict is detected, would require an extra test in the loop - a test which in the "correct" is purely ballast. Neat theory, right?
Except, that's not bourne out by the numbers; for example on my machine (YMMV) Enumerable.Range(0,100000000).Where(x=>x==123).Single()
is actually faster than Enumerable.Range(0,100000000).Single(x=>x==123)
!
It's possibly a JITter quirk of this precise expression on this machine - I'm not claiming that Where
followed by predicateless Single
is always faster.
But whatever the case, the fail-fast solution is very unlikely to be significantly slower. After all, even in the normal case, we're dealing with a cheap branch: a branch that is never taken and thus easy on the branch predictor. And of course; the branch is further only ever encountered when pred holds - that's once per call in the normal case. That cost is simply negligible compared to the cost of the delegate call pred
and its implementation, plus the cost of the interface methods .MoveNext()
and .get_Current()
and their implementations.
It's simply extremely unlikely that you'll notice the performance degradation caused by one predictable branch in comparison to all that other abstraction penalty - not to mention the fact that most sequences and predicates actually do something themselves.
That does appear to be a bad implementation, in my opinion.
Just to illustrate the potential severity of the problem:
var oneMillion = Enumerable.Range(1, 1000000)
.Select(x => { Console.WriteLine(x); return x; });
int firstEven = oneMillion.Single(x => x % 2 == 0);
The above will output all the integers from 1 to 1000000 before throwing an exception.
It's a head-scratcher for sure.
I only found this question after filing a report at https://connect.microsoft.com/VisualStudio/feedback/details/810457/public-static-tsource-single-tsource-this-ienumerable-tsource-source-func-tsource-bool-predicate-doesnt-throw-immediately-on-second-matching-result#
The side-effect argument doesn't hold water, because:
Func
for a reason.First
or the other overload of Single
.Single
, e.g. Linq2SQL uses TOP 2
to ensure that only the two matching cases needed to test for more than one match are returned.OverflowException
is thrown, which is not documented behaviour, and hence clearly a bug.Most importantly of all, if we're in a condition where we expected the sequence to have only one matching element, and yet we're not, then something has clearly gone wrong. Apart from the general principle that the only thing you should do upon detecting an error state is clean-up (and this implementation delays that) before throwing, the case of an sequence having more than one matching element is going to overlap with the case of a sequence having more elements in total than expected - perhaps because the sequence has a bug that is causing it to loop unexpectedly. So it's precisely in one possible set of bugs that should trigger the exception, that the exception is most delayed.
Edit:
Peter Lillevold's mention of a repeated test may be a reason why the author chose to take the approach they did, as an optimisation for the non-exceptional case. If so it was needless though, even aside from Eamon Nerbonne showing it wouldn't improve much. There's no need to have a repeated test in the initial loop, as we can just change what we're testing for upon the first match:
public static TSource Single<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate)
{
if(source == null)
throw new ArgumentNullException("source");
if(predicate == null)
throw new ArgumentNullException("predicate");
using(IEnumerator<TSource> en = source.GetEnumerator())
{
while(en.MoveNext())
{
TSource cur = en.Current;
if(predicate(cur))
{
while(en.MoveNext())
if(predicate(en.Current))
throw new InvalidOperationException("Sequence contains more than one matching element");
return cur;
}
}
}
throw new InvalidOperationException("Sequence contains no matching element");
}