Simple parallelisation for hashset

问题

I have 2 loops(nested), trying to do a simple parallelisation

pseudocode:

for item1 in data1 (~100 million row)
    for item2 in data2 (~100 rows)
        result = process(item1,item2) // couple of if conditions
        hashset.add(result) // while adding, incase of a duplicate i also decide wihch one to retain

process(item1,item2) to be precise has 4 if conditions bases on values in item1 and item2.(time taken is less than 50ms)

data1 size is Nx17
data2 size is Nx17
result size is 1x17 (result is joined into a string before it is added into hashset)

max output size: unknown, but i would like to be ready for atleast 500 million which means the hashset would be holding 500 million items. (how to handle so much data in a hashset would be an another question i guess)

Should i just use a concurrent hashset to make it thread safe and go with parallel.each or should i go with TASK concept

Please provide some code samples based on your opinion.

回答1:

The answer depends a lot on the costs of process(data1, data2). If this is a CPU-intensive operation, then you can surely benefit from Parallel.ForEach. Of course, you should use concurrent dictionary, or lock around your hash table. You should benchmark to see what works best for you. If process has too little impact on performance, then probably you will get nothing from the parallelization - the locking on the hashtable will kill it all.

You should also try to see whether enumerating data2 on the outer loop is also faster. It might give you another benefit - you can have a separate hashtable for each instance of data2 and then merge the results into one hashtable. This will avoid locks.

Again, you need to do your tests, there is no universal answer here.

回答2:

My suggestion is to separate the processing of the data from the saving of the results to the HashSet, because the first is parallelizable but the second is not. You could achieve this separation with the producer-consumer pattern, using a BlockingCollection and threads (or tasks). But I'll show a solution using a more specialized tool, the TPL Dataflow library. I'll assume that the data are two arrays of integers, and the processing function can produce up to 500,000,000 different results:

var data1 = Enumerable.Range(1, 100_000_000).ToArray();
var data2 = Enumerable.Range(1, 100).ToArray();

static int Process(int item1, int item2)
{
    return unchecked(item1 * item2) % 500_000_000;
}

The dataflow pipeline will have two blocks. The first block is a TransformBlock that accepts an item from the data1 array, processes it with all items of the data2 array, and returns a batch of the results (as an int array).

var processBlock = new TransformBlock<int, int[]>(item1 =>
{
    int[] batch = new int[data2.Length];
    for (int j = 0; j < data2.Length; j++)
    {
        batch[j] = Process(item1, data2[j]);
    }
    return batch;
}, new ExecutionDataflowBlockOptions()
{
    BoundedCapacity = 100,
    MaxDegreeOfParallelism = 3 // Configurable
});

The second block is and ActionBlock that receives the processed batches from the first block, and adds the individual results in the HashSet.

var results = new HashSet<int>();
var saveBlock = new ActionBlock<int[]>(batch =>
{
    for (int i = 0; i < batch.Length; i++)
    {
        results.Add(batch[i]);
    }
}, new ExecutionDataflowBlockOptions()
{
    BoundedCapacity = 100,
    MaxDegreeOfParallelism = 1 // Mandatory
});

The line below links the two blocks together, so that the data will flow automatically from the first block to the second:

processBlock.LinkTo(saveBlock,
    new DataflowLinkOptions() { PropagateCompletion = true });

The last step is to feed the first block with the items of the data1 array, and wait for the completion of the whole operation.

for (int i = 0; i < data1.Length; i++)
{
    processBlock.SendAsync(data1[i]).Wait();
}
processBlock.Complete();
saveBlock.Completion.Wait();

The HashSet contains now the results.

A note about using the BoundedCapacity option. This option controls the flow of the data, so that a fast block upstream will not flood with data a slow block downstream. Configuring properly this option increases the memory and CPU efficiency of the pipeline.

The TPL Dataflow library is built-in the .NET Core, and available as a package for .NET Framework.

来源：https://stackoverflow.com/questions/61456516/simple-parallelisation-for-hashset

标签

task-parallel-library

hashset

parallel.foreach