Advanced pattern matching in Powershell

问题

Hope you can help me with something. Thanks to @mklement0 I've gotten a great script matching the most basic, initial pattern for words in alphabetical order. However what's missing is a full text search and select. An example of current script with a small sample of a few words within a Words.txt file:

App
Apple
Apply
Sword
Swords
Word
Words

Becomes:

App
Sword
Word

This is great as it really narrows down to a basic pattern per line! However the result of it going line by line there is still a pattern that can further be narrowed down which is "Word" (capitalization not important) so ideally the output should be:

App
Word

And "Sword" is removed as it falls in more basic pattern prefixed as "Word".

Would you have any suggestion on how to achieve this? Keep in mind this will be a dictionary list of about 250k words, so I would not know what I am looking for ahead of time

CODE (from a related post, handles prefix matching only):

$outFile = [IO.File]::CreateText("C:\Temp\Results.txt")   # Output File Location
$prefix = ''                   # initialize the prefix pattern

foreach ($line in [IO.File]::ReadLines('C:\Temp\Words.txt')) # Input File name.
 {

  if ($line -like $prefix) 
    { 
    continue                   # same prefix, skip
    }

  $line                        # Visual output of new unique prefix
  $prefix = "$line*"           # Saves new prefix pattern
  $outFile.writeline($line)    # Output file write to configured location
}

回答1:

You can try a two-step approach:

Step 1: Find the list of unique prefixes in the alphabetically sorted word list. This is done by reading the lines sequentially, and therefore only requires you to hold the unique prefixes as a whole in memory.
Step 2: Sort the resulting prefixes in order of length and iterate over them, checking in each iteration whether the word at hand is already represented by a substring of it in the result list.
- The result list starts out empty, and whenever the word at hand has no substring in the result list, it is appended to the list.
- The result list is implemented as a regular expression with alternation (|), to enable matching against all already-found unique words in a single operation.

You'll have to see if the performance is good enough; for best performance, .NET types are used directly as much as possible.

# Read the input file and build the list of unique prefixes, assuming
# alphabetical sorting.
$inFilePath = 'C:\Temp\Words.txt' # Be sure to use a full path.
$uniquePrefixWords = 
  foreach ($word in [IO.File]::ReadLines($inFilePath)) {
    if ($word -like $prefix) { continue }
    $word
    $prefix = "$word*"
  }

# Sort the prefixes by length in ascending order (shorter ones first).
# Note: This is a more time- and space-efficient alternative to:
#    $uniquePrefixWords = $uniquePrefixWords | Sort-Object -Property Length
[Array]::Sort($uniquePrefixWords.ForEach('Length'), $uniquePrefixWords)

# Build the result lists of unique shortest words with the help of a regex.
# Skip later - and therefore longer - words, if they are already represented
# in the result list of word by a substring.
$regexUniqueWords = ''; $first = $true
foreach ($word in $uniquePrefixWords) {
  if ($first) { # first word
    $regexUniqueWords = $word
    $first = $false
  } elseif ($word -notmatch $regexUniqueWords) {
    # New unique word found: add it to the regex as an alternation (|)
    $regexUniqueWords += '|' + $word
  }
}

# The regex now contains all unique words, separated by "|".
# Split it into an array of individual words, sort the array again...
$resultWords = $regexUniqueWords.Split('|')
[Array]::Sort($resultWords)

# ... and write it to the output file.
$outFilePath = 'C:\Temp\Results.txt' # Be sure to use a full path.
[IO.File]::WriteAllLines($outFilePath, $resultWords)

回答2:

Reducing arbitrary substrings is a bit more complicated than prefix matching, as we can no longer rely on alphabetical sorting.

Instead, you could sort by length, and then keep track of patterns that can't be satisfied by a shorter one, by using a hash set:

function Reduce-Wildcard
{
    param(
        [string[]]$Strings,
        [switch]$SkipSort
    )

    # Create set containing all patterns, removes all duplicates
    $Patterns = [System.Collections.Generic.HashSet[string]]::new($Strings, [StringComparer]::CurrentCultureIgnoreCase)

    # Now that we only have unique terms, sort them by length
    $Strings = $Patterns |Sort-Object -Property Length

    # Start from the shortest possible pattern
    for ($i = 0; $i -lt ($Strings.Count - 1); $i++) {
        $current = $Strings[$i]
        if(-not $Patterns.Contains($current)){
            # Check that we haven't eliminated current string already
            continue
        }

        # There's no reason to search for this substring 
        # in any of the shorter strings
        $j = $i + 1
        do {
            $next = $Strings[$j]

            if($Patterns.Contains($next)){
                # Do we have a substring match?
                if($next -like "*$current*"){
                    # Eliminate the superstring
                    [void]$Patterns.Remove($next)
                }
            }

            $j++
        } while ($j -lt $Strings.Count)
    }

    # Return the substrings we have left
    return $Patterns
}

Then use like:

$strings = [IO.File]::ReadLines('C:\Temp\Words.txt')

$reducedSet = Reduce-Wildcard -Strings $strings

Now, this is definitely not the most space-efficient way of reducing your patterns, but the good news is that you can easily divide-and-conquer a large set of inputs by merging and reducing the intermediate results:

Reduce-Wildcard @(
    Reduce-Wildcard -Strings @('App','Apple')
    Reduce-Wildcard -Strings @('Sword', 'Words')
    Reduce-Wildcard -Strings @('Swords', 'Word')
)

Or, in case of multiple files, you can chain successive reductions like this:

$patterns = @()
Get-ChildItem dictionaries\*.txt |ForEach-Object {
  $patterns = Reduce-Wildcard -String @(
    $_ |Get-Content
    $patterns
  )
}

回答3:

My two cents:

Using -Like or RegEx might get expensive on the long run knowing that they used in the inner loop of the selection the invocation will increase exponentially with the size of the word list. Besides, the pattern of the -Like and RegEx operation might need to be escaped (especially for Regex where e.g. a dot . has a special meaning. I Suspect that this question has something to do with checking for password complexity).

Presuming that it doesn't matter whether the output list is in lower case, I would use the String.Contains() method. Otherwise, If the case of the output does matter, you might prepare a hash table like $List[$Word.ToLower()] = $Word and use that restore the actual case at the end.

# Remove empty words, sort by word length and change everything to lowercase
# knowing that .Contains is case sensitive (and therefore presumably a little faster)
$Words = $Words | Where-Object {$_} | Sort-Object Length | ForEach-Object {$_.ToLower()}
# Start with a list of the smallest words (I guess this is a list of all the words with 3 characters)
$Result = [System.Collections.ArrayList]@($Words | Where-Object Length -Eq $Words[0].Length)
# Add the word to the list if it doesn't contain any of the all ready listed words
ForEach($Word in $Words) {
    If (!$Result.Where({$Word.Contains($_)},'First')) { $Null = $Result.Add($Word) }
}

2020-04-23 updated the script with the suggestion from @Mathias:

You may want to use Where({$Word.Contains($_)},'First') to avoid comparing against all of $Result everytime

which is about twice as fast.

来源：https://stackoverflow.com/questions/61367263/advanced-pattern-matching-in-powershell

标签

regex

powershell

pattern-matching