Reading millions of small files with C#

问题

I have millions of log files which generating every day and I need to read all of them and put together as a single file to do some process on it in other app.

I'm looking for the fastest way to do this. Currently I'm using Threads, Tasks and parallel like this:

Parallel.For(0, files.Length, new ParallelOptions { MaxDegreeOfParallelism = 100 }, i =>
{
    ReadFiles(files[i]);
});

void ReadFiles(string file)
{
    try
    {
        var txt = File.ReadAllText(file);
        filesTxt.Add(tmp);
    }
    catch { }
    GlobalCls.ThreadNo--;
}

foreach (var file in files)
{
    //Int64 index = i;
    //var file = files[index];
    while (Process.GetCurrentProcess().Threads.Count > 100)
    { 
        Thread.Sleep(100);
        Application.DoEvents();
    }
    new Thread(() => ReadFiles(file)).Start();
    GlobalCls.ThreadNo++;
    // Task.Run(() => ReadFiles(file));      
}

The problem is that after a few thousand reading files, the reading gets slower and slower!!

Any idea why? and what's the fastest approaches to reading millions small files? Thank you.

回答1:

When it comes to IO operations, CPU parallelism is useless. Your IO device (disk, network, whatever) is your bottleneck. By reading from the device concurrently you risk to even lower your performance.

回答2:

It seems that you are loading the contents of all files in memory, before writing them back to the single file. This could explain why the process becomes slower over time.

A way to optimize the process is to separate the reading part from the writing part, and do them in parallel. This is called the producer-consumer pattern. It can be implemented with the Parallel class, or with threads, or with tasks, but I will demonstrate instead an implementation based on the powerful TPL Dataflow library, that is particularly suited for jobs like this.

private static async Task MergeFiles(IEnumerable<string> sourceFilePaths,
    string targetFilePath, CancellationToken cancellationToken = default,
    IProgress<int> progress = null)
{
    var readerBlock = new TransformBlock<string, string>(async filePath =>
    {
        return File.ReadAllText(filePath); // Read the small file
    }, new ExecutionDataflowBlockOptions()
    {
        MaxDegreeOfParallelism = 2, // Reading is parallelizable
        BoundedCapacity = 100, // No more than 100 file-paths buffered
        CancellationToken = cancellationToken, // Cancel at any time
    });

    StreamWriter streamWriter = null;

    int filesProcessed = 0;
    var writerBlock = new ActionBlock<string>(text =>
    {
        streamWriter.Write(text); // Append to the target file
        filesProcessed++;
        if (filesProcessed % 10 == 0) progress?.Report(filesProcessed);
    }, new ExecutionDataflowBlockOptions()
    {
        MaxDegreeOfParallelism = 1, // We can't parallelize the writer
        BoundedCapacity = 100, // No more than 100 file-contents buffered
        CancellationToken = cancellationToken, // Cancel at any time
    });

    readerBlock.LinkTo(writerBlock,
        new DataflowLinkOptions() { PropagateCompletion = true });

    // This is a tricky part. We use BoundedCapacity, so we must propagate manually
    // a possible failure of the writer to the reader, otherwise a deadlock may occur.
    PropagateFailure(writerBlock, readerBlock);

    // Open the output stream
    using (streamWriter = new StreamWriter(targetFilePath))
    {
        // Feed the reader with the file paths
        foreach (var filePath in sourceFilePaths)
        {
            var accepted = await readerBlock.SendAsync(filePath,
                cancellationToken); // Cancel at any time
            if (!accepted) break; // This will happen if the reader fails
        }
        readerBlock.Complete();
        await writerBlock.Completion;
    }

    async void PropagateFailure(IDataflowBlock block1, IDataflowBlock block2)
    {
        try { await block1.Completion.ConfigureAwait(false); }
        catch (Exception ex)
        {
            if (block1.Completion.IsCanceled) return; // On cancellation do nothing
            block2.Fault(ex);
        }
    }
}

Usage example:

var cts = new CancellationTokenSource();
var progress = new Progress<int>(value =>
{
    // Safe to update the UI
    Console.WriteLine($"Files processed: {value:#,0}");
});
var sourceFilePaths = Directory.EnumerateFiles(@"C:\SourceFolder", "*.log",
    SearchOption.AllDirectories); // Include subdirectories
await MergeFiles(sourceFilePaths, @"C:\AllLogs.log", cts.Token, progress);

The BoundedCapacity is used to keep the memory usage under control.

If the disk drive is SSD, you can try reading with a MaxDegreeOfParallelism larger than 2.

For best performance you could consider writing to a different disc drive than the drive containing the source files.

The TPL Dataflow library is available as a package for .NET Framework, and is build-in for .NET Core.

回答3:

Perhaps you can just use PowerShell to concatenate the files, such as in this answer.

Another alternative is to write a program that uses the FileSystemWatcher class to watch for new files and append them as they are created.

来源：https://stackoverflow.com/questions/58151529/reading-millions-of-small-files-with-c-sharp

标签

multithreading

file-io

parallel-processing

task-parallel-library