Powershell Speed: How to speed up ForEach-Object MD5/hash check

问题

I'm running the following MD5 check on 500 million files to check for duplicates. The scripts taking forever to run and I was wondering how to speed it up. How could I speed it up? Could I use a try catch loop instead of contains to throw an error when the hash already exists instead? What would you all recommend?

$folder = Read-Host -Prompt 'Enter a folder path'

$hash = @{}
$lineCheck = 0

Get-ChildItem $folder -Recurse | where {! $_.PSIsContainer} | ForEach-Object {
    $lineCheck++
    Write-Host $lineCheck
    $tempMD5 = (Get-FileHash -LiteralPath $_.FullName -Algorithm MD5).Hash;

    if(! $hash.Contains($tempMD5)){
        $hash.Add($tempMD5,$_.FullName)
    }
    else{
        Remove-Item -literalPath $_.fullname;
    }
}

回答1:

As suggested in the comments, you might consider to start hashing files only if there is a match with the file length found first. Meaning that you will not invoke the expensive hash method for any file length that is unique.

_{*Note: that the Write-Host command is quite expensive by itself, therefore I would not display every iteration (Write-Host $lineCheck) but e.g. only when a match is found.}

$Folder = Read-Host -Prompt 'Enter a folder path'

$FilesBySize = @{}
$FilesByHash = @{}

Function MatchHash([String]$FullName) {
    $Hash = (Get-FileHash -LiteralPath $FullName -Algorithm MD5).Hash
    $Found = $FilesByHash.Contains($Hash)
    If ($Found) {$Null = $FilesByHash[$Hash].Add($FullName)}
    Else {$FilesByHash[$Hash] = [System.Collections.ArrayList]@($FullName)}
    $Found
}

Get-ChildItem $Folder -Recurse | Where-Object -Not PSIsContainer | ForEach-Object {
    $Files = $FilesBySize[$_.Length]
    If ($Files) {
        If ($Files.Count -eq 1) {$Null = MatchHash $Files[0]}
        If ($Files.Count -ge 1) {If (MatchHash $_) {Write-Host 'Found match:' $_.FullName}}
        $Null = $FilesBySize[$_.Length].Add($_.FullName)
    } Else {
        $FilesBySize[$_.Length] = [System.Collections.ArrayList]@($_.FullName)
    }
}

Display the found duplicates:

ForEach($Hash in $FilesByHash.GetEnumerator()) {
    If ($Hash.Value.Count -gt 1) {
        Write-Host 'Hash:' $Hash.Name
        ForEach ($File in $Hash.Value) {
            Write-Host 'File:' $File
        }
    }
}

回答2:

I'd guess that the slowest part of your code is the Get-FileHash invocation, since everything else is either not computationally intensive or limited by your hardware (disk IOPS).

You could try replacing it with the invocation of the native tool which has more optimized MD5 implementation and see if it helps.

Could I use a try catch loop instead of contains to throw an error when the hash already exists instead?

Exceptions are slow and using them for flow control is not recommended:

DA0007: Avoid using exceptions for control flow

While the use of exception handlers to catch errors and other events that disrupt program execution is a good practice, the use of exception handler as part of the regular program execution logic can be expensive and should be avoided

https://stackoverflow.com/a/162027/4424236

There is the definitive answer to this from the guy who implemented them - Chris Brumme. He wrote an excellent blog article about the subject (warning - its very long)(warning2 - its very well written, if you're a techie you'll read it to the end and then have to make up your hours after work :) )

The executive summary: they are slow. They are implemented as Win32 SEH exceptions, so some will even pass the ring 0 CPU boundary!

回答3:

I know this is a PowerShell question, but you can make good use of parallelization in C#. You also mentioned in one of the comments about using C# as an alternative, so I thought it wouldn't hurt posting a possible implemenation of how it could be done.

You could first create a method to calculate the MD5 Checksum for a file:

private static string CalculateMD5(string filename)
{
    using var md5 = MD5.Create();
    using var stream = File.OpenRead(filename);
    var hash = md5.ComputeHash(stream);
    return BitConverter.ToString(hash).Replace("-", string.Empty).ToLowerInvariant();
}

Then you could make a method with queries all file hashes in parellel using ParallelEnumerable.AsParallel():

private static IEnumerable<FileHash> FindFileHashes(string directoryPath)
{
    var allFiles = Directory
        .EnumerateFiles(directoryPath, "*", SearchOption.AllDirectories);

    var hashedFiles = allFiles
        .AsParallel()
        .Select(filename => new FileHash { 
            FileName = filename, 
            Hash = CalculateMD5(filename) 
        });

    return hashedFiles;
}

Then you can simply use the above method to delete duplicate files:

private static void DeleteDuplicateFiles(string directoryPath)
{
    var fileHashes = new HashSet<string>();

    foreach (var fileHash in FindFileHashes(directoryPath))
    {
        if (!fileHashes.Contains(fileHash.Hash))
        {
            Console.WriteLine($"Found - File : {fileHash.FileName} Hash : {fileHash.Hash}");
            fileHashes.Add(fileHash.Hash);
            continue;
        }

        Console.WriteLine($"Deleting - File : {fileHash.FileName} Hash : {fileHash.Hash}");
        File.Delete(fileHash.FileName);
    }
}

Full Program:

using System;
using System.Collections.Generic;
using System.Linq;
using System.IO;
using System.Security.Cryptography;

namespace Test
{
    internal class FileHash
    {
        public string FileName { get; set; }
        public string Hash { get; set; }
    }

    public class Program
    {
        public static void Main()
        { 
            var path = "C:\\Path\To\Files";
            if (File.Exists(path))
            {
                Console.WriteLine($"Deleting duplicate files at {path}");
                DeleteDuplicateFiles(path);
            }
        }

        private static void DeleteDuplicateFiles(string directoryPath)
        {
            var fileHashes = new HashSet<string>();

            foreach (var fileHash in FindFileHashes(directoryPath))
            {
                if (!fileHashes.Contains(fileHash.Hash))
                {
                    Console.WriteLine($"Found - File : {fileHash.FileName} Hash : {fileHash.Hash}");
                    fileHashes.Add(fileHash.Hash);
                    continue;
                }

                Console.WriteLine($"Deleting - File : {fileHash.FileName} Hash : {fileHash.Hash}");
                File.Delete(fileHash.FileName);
            }
        }

        private static IEnumerable<FileHash> FindFileHashes(string directoryPath)
        {
            var allFiles = Directory
                .EnumerateFiles(directoryPath, "*", SearchOption.AllDirectories);

            var hashedFiles = allFiles
                .AsParallel()
                .Select(filename => new FileHash { 
                    FileName = filename, 
                    Hash = CalculateMD5(filename) 
                });

            return hashedFiles;
        }

        private static string CalculateMD5(string filename)
        {
            using var md5 = MD5.Create();
            using var stream = File.OpenRead(filename);
            var hash = md5.ComputeHash(stream);
            return BitConverter.ToString(hash).Replace("-", string.Empty).ToLowerInvariant();
        }
    }
}

回答4:

If you're trying to find duplicates the fastest way to do this is to use something like jdupes or fdupes. These are incredibly performant and written in C.

来源：https://stackoverflow.com/questions/59914704/powershell-speed-how-to-speed-up-foreach-object-md5-hash-check

标签

powershell

file

hash

parallel-processing

md5