Powershell question - Looking for fastest method to loop through 500k objects looking for a match in another 500k object array

≯℡__Kan透↙ 提交于 2021-02-11 15:33:08

问题


I have two large .csv files that I've imported using the import-csv cmdlet. I've done a lot of searching and trying and am finally posting to ask for some help to make this easier.

I need to move through the first array that will have anywhere from 80k rows to 500k rows. Each object in these arrays has multiple properties, and I then need to find the corresponding entry in a second array of the same size matching on a property from there.

I'm importing them as [systems.collection.arrayList] and I've tried to place them as hashtables too. I have even tried to muck with LINQ which was mentioned in several other posts.

Any chance anyone can offer advice or insight how to make this run faster? It feels like I'm looking in one haystack for matching hay in a different stack.

$ImportTime1 = Measure-Command {
    [System.Collections.ArrayList]$fileList1 = Import-csv file1.csv
    [System.Collections.ArrayList]$fileSorted1 = ($fileList1 | Sort-Object -property 'Property1' -Unique -Descending)
    Remove-Variable fileList1
}

$ImportTime2 = Measure-Command {
    [System.Collections.ArrayList]$fileList2 = Import-csv file2.csv
    [System.Collections.ArrayList]$fileSorted2 = ($fileList2 | Sort-Object -property 'Property1' -Unique -Descending)
    Remove-Variable fileList2
}

$fileSorted1.foreach({
     $varible1 = $_
     $target = $fileSorted2.where({$_ -eq $variable1})
     ###do some other stuff
})

回答1:


This may be of use: https://powershell.org/forums/topic/comparing-two-multi-dimensional-arrays/

The updated solution in comment #27359 + add the suggested change by Max Kozlov in comment #27380.

Function RJ-CombinedCompare() {
    [CmdletBinding()]
    PARAM(
        [Parameter(Mandatory=$True)]$List1,
        [Parameter(Mandatory=$True)]$L1Match,
        [Parameter(Mandatory=$True)]$List2,
        [Parameter(Mandatory=$True)]$L2Match
    )
    $hash = @{}
    foreach ($data in $List1) {$hash[$data.$L1Match] += ,[pscustomobject]@{Owner=1;Value=$($data)}}
    foreach ($data in $List2) {$hash[$data.$L2Match] += ,[pscustomobject]@{Owner=2;Value=$($data)}}
    foreach ($kv in $hash.GetEnumerator()) {
        $m1, $m2 = $kv.Value.where({$_.Owner -eq 1}, 'Split')
        [PSCustomObject]@{
            MatchValue = $kv.Key
            L1Matches = $m1.Count
            L2Matches = $m2.Count
            L1MatchObject = $L1Match
            L2MatchObject = $L2Match
            List1 = $m1.Value
            List2 = $m2.Value
        }
    }
}

$fileList1 = Import-csv file1.csv
$fileList2 = Import-csv file2.csv

$newList = RJ-CombinedCompare -List1 $fileList1 -L1Match $(yourcolumnhere) -List2 $fileList2 -L2Match $(yourothercolumnhere)

foreach ($item in $newList) {
    # your logic here
}

It should be fast to pass the lists into this hashtable and it's fast to iterate through as well.



来源:https://stackoverflow.com/questions/59889903/powershell-question-looking-for-fastest-method-to-loop-through-500k-objects-lo

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!