How can I make this PowerShell script parse large files faster?

后端 未结 3 644
名媛妹妹
名媛妹妹 2020-12-01 12:58

I have the following PowerShell script that will parse some very large file for ETL purposes. For starters my test file is ~ 30 MB. Larger files around 200 MB are

3条回答
  •  天涯浪人
    2020-12-01 13:33

    Your script reads one line at a time (slow!) and stores almost the entire file in memory (big!).

    Try this (not tested extensively):

    $path = "E:\Documents\Projects\ESPS\Dev\DataFiles\DimProductionOrderOperation"
    $infile = "14SEP11_ProdOrderOperations.txt"
    $outfile = "PROCESSED_14SEP11_ProdOrderOperations.txt"
    
    $batch = 1000
    
    [regex]$match_regex = '^\|.+\|.+\|.+'
    [regex]$replace_regex = '^\|(.+)\|$'
    
    $header_line = (Select-String -Path $path\$infile -Pattern $match_regex -list).line
    
    [regex]$header_regex = [regex]::escape($header_line)
    
    $header_line.trim('|') | Set-Content $path\$outfile
    
    Get-Content $path\$infile -ReadCount $batch |
        ForEach {
                 $_ -match $match_regex -NotMatch $header_regex -Replace $replace_regex ,'$1' | Out-File $path\$outfile -Append
        }
    

    That's a compromise between memory usage and speed. The -match and -replace operators will work on an array, so you can filter and replace an entire array at once without having to foreach through every record. The -readcount will cause the file to be read in chunks of $batch records, so you're basically reading in 1000 records at a time, doing the match and replace on that batch then appending the result to your output file. Then it goes back for the next 1000 records. Increasing the size of $batch should speed it up, but it will make it use more memory. Adjust that to suit your resources.

提交回复
热议问题