Dataflow with splitting work to small jobs and then group again

后端 未结 3 1082
终归单人心
终归单人心 2020-12-31 09:36

I need to do this kind of work:

  1. Get Page object from database
  2. For each page get all images and process them (IO bound, for example, upload to CDN)
3条回答
  •  萌比男神i
    2020-12-31 09:49

    The .NET platform has a nice interface that can represent parent-child relationships, the IGrouping interface. It is simply an IEnumerable that also has a Key property. The key can be anything, and in this case in could be the Page that needs to be processed. The contents of the grouping could be the Images that belong to each page, and need to be uploaded. This leads to the idea of a dataflow block that can process IGrouping objects, by processing independently each TInput, then aggegate the results per grouping, and finally output them as IGrouping objects. Below is an implementation of this idea:

    public static TransformBlock, IGrouping>
        CreateTransformGroupingBlock(
            Func> transform,
            ExecutionDataflowBlockOptions dataflowBlockOptions = null)
    {
        if (transform == null) throw new ArgumentNullException(nameof(transform));
        dataflowBlockOptions ??= new ExecutionDataflowBlockOptions();
    
        var actionBlock = new ActionBlock>>(taskTask =>
        {
            // An exception thrown by the following line would cause buggy behavior.
            // According to the documentation it should never fail.
            taskTask.RunSynchronously();
            return taskTask.Unwrap();
        }, dataflowBlockOptions);
    
        var completionCTS = new CancellationTokenSource();
        _ = actionBlock.Completion
            .ContinueWith(_ => completionCTS.Cancel(), TaskScheduler.Default);
    
        var transformBlock = new TransformBlock,
            IGrouping>(async grouping =>
        {
            if (grouping == null) throw new InvalidOperationException("Null grouping.");
            var tasks = new List>();
            foreach (var item in grouping)
            {
                // Create a cold task that will be either executed by the actionBlock,
                // or will be canceled by the completionCTS. This should eliminate
                // any possibility that an awaited task will remain cold forever.
                var taskTask = new Task>(() => transform(grouping.Key, item),
                    completionCTS.Token);
                var accepted = await actionBlock.SendAsync(taskTask);
                if (!accepted)
                {
                    // The actionBlock has failed.
                    // Skip the rest of the items. Pending tasks should still be awaited.
                    tasks.Add(Task.FromCanceled(new CancellationToken(true)));
                    break;
                }
                tasks.Add(taskTask.Unwrap());
            }
            TOutput[] results = await Task.WhenAll(tasks);
            return results.GroupBy(_ => grouping.Key).Single(); // Convert to IGrouping
        }, dataflowBlockOptions);
    
        // Cleanup
        _ = transformBlock.Completion
            .ContinueWith(_ => actionBlock.Complete(), TaskScheduler.Default);
        _ = Task.WhenAll(actionBlock.Completion, transformBlock.Completion)
            .ContinueWith(_ => completionCTS.Dispose(), TaskScheduler.Default);
    
        return transformBlock;
    }
    
    // Overload with synchronous lambda
    public static TransformBlock, IGrouping>
        CreateTransformGroupingBlock(
            Func transform,
            ExecutionDataflowBlockOptions dataflowBlockOptions = null)
    {
        if (transform == null) throw new ArgumentNullException(nameof(transform));
        return CreateTransformGroupingBlock(
            (key, item) => Task.FromResult(transform(key, item)), dataflowBlockOptions);
    }
    

    This implementation consists of two blocks, a TransformBlock that processes the groupings and an internal ActionBlock that processes the individual items. Both are configured with the same user-supplied options. The TransformBlock sends to the ActionBlock the items to be processed one by one, then waits for the results, and finally constructs the output IGrouping with the following tricky line:

    return results.GroupBy(_ => grouping.Key).Single(); // Convert to IGrouping
    

    This compensates for the fact that currently there is no publicly available class that implements the IGrouping interface, in the .NET platform. The GroupBy+Single combo does the trick, but it has the limitation that it doesn't allow the creation of empty IGroupings. In case this is an issue, creating a class that implements this interface is always an option. Implementing one is quite straightforward (here is an example).

    Usage example of the CreateTransformGroupingBlock method:

    var processPages = new TransformBlock>(page =>
    {
        Image[] images = GetImagesFromDB(page);
        return images.GroupBy(_ => page).Single(); // Convert to IGrouping
    }, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 8 });
    
    var uploadImages = CreateTransformGroupingBlock(async (page, image) =>
    {
        await UploadImage(image);
        return image;
    }, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 8 });
    
    var savePages = new ActionBlock>(grouping =>
    {
        var page = grouping.Key;
        foreach (var image in grouping) SaveImageToDB(image, page);
    }, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 5 });
    
    processPages.LinkTo(uploadImages);
    uploadImages.LinkTo(savePages);
    

    The type of the uploadImages variable is TransformBlock, IGrouping>. In this example the types TInput and TOutput are the same, because the images need not to be transformed.

提交回复
热议问题