Powershell for Matching and Replacing Partially Matching Patterns

被刻印的时光 ゝ 提交于 2021-02-08 06:39:38

问题


Been going crazy all week unable to solve this issue. I have a dictionary word file that will be a few million words at one point, for now let's assume it's just a text file "Words.txt" which has:

App
Apple
Application
Bar
Bat
Batter
Cap
Capital
Candy

What I need it to do is to match each string against the rest of the file and only write output of the first hit. This will be alphabetical.

Example the desired output from the words above would be:

App - due to pattern "App" being seen first and skips "Apple" and "Application
Bar - due to pattern "Bar", unique
Bat - due to pattern "Bat" being seen first and skips "Batter"
Cap - due to pattern "Cap" being seen first and skips "Capital"
Candy - due to pattern "Candy", unique

What I absolutely cannot figure out how to do it is how to ignore matches that happen after initial hit and move to a 'new' pattern. It would be ok if other redundant patters are overwritten or just skipped, doesnt matter how.

I have a script to match patterns but I dont know how to end up with the desired output :( Any help?!?!


$Words = "C:\Words.txt"

[System.Collections.ArrayList]$WordList = Get-Content $Words

$Words
$Words2 = $Words
$i = 0
$r = 0
Foreach ($item in $Words)
{
    foreach ($item2 in $Words2)
    {
            if ($item2 -like "$item*")
            {
            write-host $("Match " + [string]$i + " " + $item + " " + [string]$r + " " + $item2)
            }

            $r++
    }
$i++
} 

回答1:


It's sufficient to process the lines one by one and compare them to the most recent unique prefix:

$prefix = '' # initialize the prefix pattern
foreach ($line in [IO.File]::ReadLines('C:\Words.txt')) {
  if ($line -like $prefix) { continue } # same prefix, skip
  $line               # output new unique prefix
  $prefix = "$line*"  # save new prefix pattern
}

Note: Since you mention the input file being large, I'm using System.IO.File.ReadLines rather than Get-Content to read the file, for superior performance.

Note: Your sample input path is a full path anyway, but be sure to always pass full paths to .NET methods, because .NET's working directory usually differs from PowerShell's.

If you wrap the foreach loop in & { ... }, you can pipe the result in streaming fashion (line by line, without collecting all results in memory first) to Set-Content.

However, using a .NET type for saving as well will perform much better - see the bottom section of this answer.



来源:https://stackoverflow.com/questions/61304912/powershell-for-matching-and-replacing-partially-matching-patterns

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!