Parallel.For partitioning

问题

How is partitioning done for something like

Parallel.For(0, buffer.Length, (i)=> buffer[i] = 0);

My assumption was that for an n core machine, work would be partitioned n way and n threads will carry out the work payload. Which means for example buffer.Length = 100 and n = 4, each thread will get 0-24, 25-49, 50-74, 75-99 blocks. (100 element array is an example to illustrate partitioning, but please consider an array of millions of items.)

Is this a fair assumption? Please discuss.

I noticed that Array.Clear(...) would perform much faster in this specific scenario. How do you rationalize this?

回答1:

First for the easy part. A 100-element array is so small that it can easily fit in a core's cache. Besides, clearing the array is equivalent to setting a memory area to 0s, something that is available as a CPU command and therefore as fast as you can make it.

In fact, SSE commands and parallel-optimized memory controllers mean that the chipset can probalbly clear memory in parallel using only a single CPU command.

On the other hand, Parallel.For introduces some overhead. It has to partition the data, create the appropriate tasks to work on them, collect the results and return the final result. Below Parallel.For, the runtime has to copy the data to each core, handle memory synchronization, collect the results etc. In your example, this can be significantly larger that the actual time needed to zero the memory locations.

In fact, for small sizes it is quite possible that 99.999% of the overhead is memory synchronization as each core tries to access the same memory page. Remember, memory locking is at the page level and you can fit 2K 16-bit ints in a 4K memory page.

As for how PLINQ schedules tasks - there are many different partitioning schemes used, depending on the operators you use. Check Partitioning in LINQ for a nice intro. In any case, the partitioner will try to determine whether there is any benefit to be gained from partitioning and may not partition the data at all.

In your case, the partitioner will probably use Ranged partitioning. Your payload uses only a few CPU cycles so all you see is the overhead of partitioning, creating tasks, managing synchronization and collecting the results.

A better benchmark would be to run some aggregations on a large array, eg. counts and averages and the like.

回答2:

The optimisation of PFX/PLINQ is complex. However, here is the basic picture...

Input-Side Optimisation:

PLINQ has three partitioning strategies for assigning input elements to threads:

Strategy Element allocationRelative performance
Chunk partitioning         Dynamic                Average
Range partitioning         Static                    Poor to excellent
Hash partitioning           Static                    Poor

For query operators that require comparing elements (GroupBy, Join, GroupJoin etc.) PLINQ always chooses hash partitioning which is relatively inefficient because it must pre-calculate the hash code of every element (so that elements with identical codes can be run on the same thread).

For all other query operators you can choose either range or chunk partitioning. By default, if the input sequence is indexable (if it is and array of inherits from IList<T>) PLINQ will choose range partitioning; otherwise it will choose chunk partitioning.

range partitioning is faster with long sequences for which every element takes a similar amount of CPU time. Otherwise, chunk partitioning is faster.

How they work:

Chunk partitioning works by having each worker thread periodically grab small 'chunks' of elements from the input sequence to process. PLINQ starts by allocating very small chunks and then increases this amount as the query progresses; this ensures that small sequences are effectively parallelized and large sequences don't case excessive 'round-tripping'. If a worker thread happens to finish it job quickly it will end up getting more chunks. This system keeps every thread equally busy and the machine's core 'balanced'. The downside of this method is that fetching elements from a shared input sequence requires locking and this can add overhead.

Range partitioning bypasses the normal input-side enumeration and pre-allocates an equal number of element to each worker thread avoiding contention on the input sequence. If a thread finishes early using this method it will sit idle until the other threads have finished.

Parralell For and Foreach:

By default, for For/Foreach loops PLINQ will use range partitioning.

I hope this helps.

来源：https://stackoverflow.com/questions/17785049/parallel-for-partitioning

标签

.net

performance

task-parallel-library