How to process data from a file in parallel in several threads and write them into another file while keeping the original data order (C#)

问题

I would like to ask you rather a general question (even though I'm rather interested in how to achieve it in C#).

I have a huge file which I want to read by chunks, process the chunks somehow in parallel in several threads to make the processing faster and then write the processed data to another file in the same order as the original data chunks were read (i.e. making sure that the first data chunk read from the input file will be processed and saved first in the output file, the second chunk will be processed and saved as the second data block to the output file, etc...).

I was thinking of implementing producer-consumer somehow (i.e. reading from the original file continuously by chunks - feeding some queue from which a set of threads would read and process the data) but I have no idea how to write the processed data from these threads to the output file to keep the order of the data. Even if I tried to put processed data blocks produced by the threads to another queue from which they could be consumed and written to the output file I would still have no control over the order of the data being returned from the threads (thus writing them to the output file in the correct order).

Any suggestions?

I'm new to this stuff so even theoretical hints would mean a lot to me.

回答1:

Although this question is a little open-ended and shows no code...

There are various approaches to this problem and they all depend exactly on your requirements and limitations.

Though first and foremost, if the bottle-neck you are trying to solve is the IO, parallel anything is likely not going to help.

However, if you need to retain order after processing CPU bound work in parallel, there are various TPL methods which maintain ordering, such as

PLinq which has ParallelEnumerable.AsOrdered
TPL DataFlow blocks which have parallel options with DataflowBlockOptions.EnsureOrdered.
Also you could probably use Reactive Extensions (RX) which I believe has similar

The easiest approach (assuming the data couldn't be read and written in discrete blocks) would be to read file chunks (buffer) synchronously, process the data in parallel with the ensure ordered functionality, and write back to the file in batches. You would obviously have to play around with the amount of file data you read and write (buffer size) to see what works for your situation.

It's worth a note, you can achieve read/write async IO but it would likely rely on a fixed size record (mutually exclusive) structure of the file.

回答2:

Here is a method you could use to process a file in chunks using parallelism, and write the processed chunks in another file keeping the original order. This method uses the TPL Dataflow library, available as a package here. You don't need to install this package if you are using .NET Core, since TPL Dataflow is embedded in this platform. Another dependency is the System.Interactive package, that includes the Buffer method, used to chunkify the file's lines.

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading;
using System.Threading.Tasks;
using System.Threading.Tasks.Dataflow;
//...

public static async Task ProcessFile(string sourcePath, Encoding sourceEncoding,
    string targetPath, Encoding targetEncoding,
    Func<string, string> lineTransformation,
    int degreeOfParallelism, int chunkSize)
{
    using StreamWriter writer = new StreamWriter(targetPath, false, targetEncoding);
    var cts = new CancellationTokenSource();

    var processingBlock = new TransformBlock<IList<string>, IList<string>>(chunk =>
    {
        return chunk.Select(line => lineTransformation(line)).ToArray();
    }, new ExecutionDataflowBlockOptions()
    {
        MaxDegreeOfParallelism = degreeOfParallelism,
        BoundedCapacity = 100, // prevent excessive buffering
        EnsureOrdered = true, // this is the default, but lets be explicit
        CancellationToken = cts.Token, // have a way to abort the processing
    });

    var writerBlock = new ActionBlock<IList<string>>(chunk =>
    {
        foreach (var line in chunk)
        {
            writer.WriteLine(line);
        }
    }); // The default options are OK for this block

    // Link the blocks and propagate completion
    processingBlock.LinkTo(writerBlock,
        new DataflowLinkOptions() { PropagateCompletion = true });

    // In case the writer block fails, the processing block must be canceled
    OnFaultedCancel(writerBlock, cts);

    static async void OnFaultedCancel(IDataflowBlock block, CancellationTokenSource cts)
    {
        try
        {
            await block.Completion.ConfigureAwait(false);
        }
        catch
        {
            cts.Cancel();
        }
    }

    // Feed the processing block with chunks from the source file
    await Task.Run(async () =>
    {
        try
        {
            var chunks = File.ReadLines(sourcePath, sourceEncoding)
                .Buffer(chunkSize);
            foreach (var chunk in chunks)
            {
                var sent = await processingBlock.SendAsync(chunk, cts.Token)
                    .ConfigureAwait(false);
                if (!sent) break; // Happens in case of a processing failure
            }
            processingBlock.Complete();
        }
        catch (OperationCanceledException)
        {
            processingBlock.Complete(); // Cancellation is not an error
        }
        catch (Exception ex)
        {
            // Reading error
            // Propagate by completing the processing block in a faulted state
            ((IDataflowBlock)processingBlock).Fault(ex);
        }
    }).ConfigureAwait(false);

    // All possible exceptions have been propagated to the writer block
    await writerBlock.Completion.ConfigureAwait(false);
}

This method uses the new C# 8 syntax for the using statement. If you are using an earlier C# version you'll have to add curly brackets and indentation. It also uses a static local function (also C# 8 syntax) that you may have to move to the outer scope.

In case of an exception in the lineTransformation function, the output file will remain incomplete, and also it may contain processed lines that come after the faulted line. So in case of an exception make sure not to use the output file. You could also include a conditional File.Delete logic inside the ProcessFile method if you want.

Bellow is a usage example of this method. It transforms asynchronously a large text file to uppercase:

await ProcessFile("Source.txt", Encoding.UTF8, "Target.txt", Encoding.UTF8, line =>
{
    return line.ToUpper();
}, degreeOfParallelism: 3, chunkSize: 100);

A known flaw of the ProcessFile method is that it fakes asynchrony by Task.Runing around the synchronous File.ReadLines method. Unfortunately there is currently no efficient built-in method for reading asynchronously the lines of a text file, in either .NET Framework or .NET Core.

回答3:

You should use Microsoft's Reactive Framework (aka Rx) - NuGet System.Reactive and add using System.Reactive.Linq; - then you can do this:

IDisposable subscription =
    File
        .ReadLines("Huge File.txt")
        .ToObservable()
        .Buffer(200)
        .Select((lines, index) => new { lines, index })
        .SelectMany(lis => Observable.Start(() => new { lis.index, output = ProcessChunk(lis.lines) }))
        .ToArray()
        .Select(xs => xs.OrderBy(x => x.index).SelectMany(x => x.output))
        .Subscribe(xs =>
        {
            File.WriteAllLines("Output File.txt", xs.ToArray());
        });

That's processing lines 200 at a time, in parallel.

Keep in mind that IO is so much slower than CPU processing, so unless ProcessChunk is very CPU intensive then any multithreading approach may not improve performance - in fact it might slow it down.

来源：https://stackoverflow.com/questions/60403335/how-to-process-data-from-a-file-in-parallel-in-several-threads-and-write-them-in

标签

multithreading

queue

producer-consumer