Does it make sense to you to use for every normal foreach a parallel.foreach loop ?
When should I start using parallel.foreach, only iterating 1,000,000 items?
The short answer is no, you should not just use Parallel.ForEach or related constructs on each loop that you can.
Parallel has some overhead, which is not justified in loops with few, fast iterations. Also, break is significantly more complex inside these loops.
Parallel.ForEach is a request to schedule the loop as the task scheduler sees fit, based on number of iterations in the loop, number of CPU cores on the hardware and current load on that hardware. Actual parallel execution is not always guaranteed, and is less likely if there are fewer cores, the number of iterations is low and/or the current load is high.
See also Does Parallel.ForEach limits the number of active threads? and Does Parallel.For use one Task per iteration?
The long answer:
We can classify loops by how they fall on two axes:
A third factor is if the tasks vary in duration very much – for instance if you are calculating points on the Mandelbrot set, some points are quick to calculate, some take much longer.
When there are few, fast iterations it's probably not worth using parallelisation in any way, most likely it will end up slower due to the overheads. Even if parallelisation does speed up a particular small, fast loop, it's unlikely to be of interest: the gains will be small and it's not a performance bottleneck in your application so optimise for readability not performance.
Where a loop has very few, slow iterations and you want more control, you may consider using Tasks to handle them, along the lines of:
var tasks = new List(actions.Length);
foreach(var action in actions)
{
tasks.Add(Task.Factory.StartNew(action));
}
Task.WaitAll(tasks.ToArray());
Where there are many iterations, Parallel.ForEach is in its element.
The Microsoft documentation states that
When a parallel loop runs, the TPL partitions the data source so that the loop can operate on multiple parts concurrently. Behind the scenes, the Task Scheduler partitions the task based on system resources and workload. When possible, the scheduler redistributes work among multiple threads and processors if the workload becomes unbalanced.
This partitioning and dynamic re-scheduling is going to be harder to do effectively as the number of loop iterations decreases, and is more necessary if the iterations vary in duration and in the presence of other tasks running on the same machine.
I ran some code.
The test results below show a machine with nothing else running on it, and no other threads from the .Net Thread Pool in use. This is not typical (in fact in a web-server scenario it is wildly unrealistic). In practice, you may not see any parallelisation with a small number of iterations.
The test code is:
namespace ParallelTests
{
class Program
{
private static int Fibonacci(int x)
{
if (x <= 1)
{
return 1;
}
return Fibonacci(x - 1) + Fibonacci(x - 2);
}
private static void DummyWork()
{
var result = Fibonacci(10);
// inspect the result so it is no optimised away.
// We know that the exception is never thrown. The compiler does not.
if (result > 300)
{
throw new Exception("failed to to it");
}
}
private const int TotalWorkItems = 2000000;
private static void SerialWork(int outerWorkItems)
{
int innerLoopLimit = TotalWorkItems / outerWorkItems;
for (int index1 = 0; index1 < outerWorkItems; index1++)
{
InnerLoop(innerLoopLimit);
}
}
private static void InnerLoop(int innerLoopLimit)
{
for (int index2 = 0; index2 < innerLoopLimit; index2++)
{
DummyWork();
}
}
private static void ParallelWork(int outerWorkItems)
{
int innerLoopLimit = TotalWorkItems / outerWorkItems;
var outerRange = Enumerable.Range(0, outerWorkItems);
Parallel.ForEach(outerRange, index1 =>
{
InnerLoop(innerLoopLimit);
});
}
private static void TimeOperation(string desc, Action operation)
{
Stopwatch timer = new Stopwatch();
timer.Start();
operation();
timer.Stop();
string message = string.Format("{0} took {1:mm}:{1:ss}.{1:ff}", desc, timer.Elapsed);
Console.WriteLine(message);
}
static void Main(string[] args)
{
TimeOperation("serial work: 1", () => Program.SerialWork(1));
TimeOperation("serial work: 2", () => Program.SerialWork(2));
TimeOperation("serial work: 3", () => Program.SerialWork(3));
TimeOperation("serial work: 4", () => Program.SerialWork(4));
TimeOperation("serial work: 8", () => Program.SerialWork(8));
TimeOperation("serial work: 16", () => Program.SerialWork(16));
TimeOperation("serial work: 32", () => Program.SerialWork(32));
TimeOperation("serial work: 1k", () => Program.SerialWork(1000));
TimeOperation("serial work: 10k", () => Program.SerialWork(10000));
TimeOperation("serial work: 100k", () => Program.SerialWork(100000));
TimeOperation("parallel work: 1", () => Program.ParallelWork(1));
TimeOperation("parallel work: 2", () => Program.ParallelWork(2));
TimeOperation("parallel work: 3", () => Program.ParallelWork(3));
TimeOperation("parallel work: 4", () => Program.ParallelWork(4));
TimeOperation("parallel work: 8", () => Program.ParallelWork(8));
TimeOperation("parallel work: 16", () => Program.ParallelWork(16));
TimeOperation("parallel work: 32", () => Program.ParallelWork(32));
TimeOperation("parallel work: 64", () => Program.ParallelWork(64));
TimeOperation("parallel work: 1k", () => Program.ParallelWork(1000));
TimeOperation("parallel work: 10k", () => Program.ParallelWork(10000));
TimeOperation("parallel work: 100k", () => Program.ParallelWork(100000));
Console.WriteLine("done");
Console.ReadLine();
}
}
}
the results on a 4-core Windows 7 machine are:
serial work: 1 took 00:02.31
serial work: 2 took 00:02.27
serial work: 3 took 00:02.28
serial work: 4 took 00:02.28
serial work: 8 took 00:02.28
serial work: 16 took 00:02.27
serial work: 32 took 00:02.27
serial work: 1k took 00:02.27
serial work: 10k took 00:02.28
serial work: 100k took 00:02.28
parallel work: 1 took 00:02.33
parallel work: 2 took 00:01.14
parallel work: 3 took 00:00.96
parallel work: 4 took 00:00.78
parallel work: 8 took 00:00.84
parallel work: 16 took 00:00.86
parallel work: 32 took 00:00.82
parallel work: 64 took 00:00.80
parallel work: 1k took 00:00.77
parallel work: 10k took 00:00.78
parallel work: 100k took 00:00.77
done
Running code Compiled in .Net 4 and .Net 4.5 give much the same results.
The serial work runs are all the same. It doesn't matter how you slice it, it runs in about 2.28 seconds.
The parallel work with 1 iteration is slightly longer than no parallelism at all. 2 items is shorter, so is 3 and with 4 or more iterations is all about 0.8 seconds.
It is using all cores, but not with 100% efficiency. If the serial work was divided 4 ways with no overhead it would complete in 0.57 seconds (2.28 / 4 = 0.57).
In other scenarios I saw no speed-up at all with parallel 2-3 iterations. You do not have fine-grained control over that with Parallel.ForEach and the algorithm may decide to "partition " them into just 1 chunk and run it on 1 core if the machine is busy.