Optimize Word document keyword search

前端 未结 1 1391
鱼传尺愫
鱼传尺愫 2020-12-18 16:28

I\'m trying to search for keywords across a large number of MS Word documents, and return the results to a file. I\'ve got a working script, but I wasn\'t aware of the scale

相关标签:
1条回答
  • 2020-12-18 17:20

    Right now you're doing this (pseudocode):

    foreach $Keyword {
        create Word Application
        foreach $File {
            load Word Document from $File
            find $Keyword
        }
    }
    

    That means that if you have a 100 keywords and 10 documents, you're opening and closing a 100 instances of Word and loading in a thousand word documents before you're done.

    Do this instead:

    create Word Application
    foreach $File {
        load Word Document from $File
        foreach $Keyword {
            find $Keyword
        }
    }
    

    So you only launch one instance of Word and only load each document once.


    As noted in the comments, you may optimize the whole process by using the OpenXML SDK, rather than launching Word:

    (assuming you've installed OpenXML SDK in its default location)

    # Import the OpenXML library
    Add-Type -Path 'C:\Program Files (x86)\Open XML SDK\V2.5\lib\DocumentFormat.OpenXml.dll'
    
    # Grab the keywords and file names    
    $Keywords  = Get-Content C:\scratch\CompareData.txt
    $Documents = Get-childitem -path $Path -Recurse -Include *.docx  
    
    # hashtable to store results per document
    $KeywordMatches = @{}
    
    # store OpenXML word document type in variable as a shorthand
    $WordDoc = [DocumentFormat.OpenXml.Packaging.WordprocessingDocument] -as [type]
    
    foreach($Docx in $Docs)
    {
        # create array to hold matched keywords
        $KeywordMatches[$Docx.FullName] = @()
    
        # open document, wrap content stream in streamreader 
        $Document       = $WordDoc::Open($Docx.FullName, $false)
        $DocumentStream = $Document.MainDocumentPart.GetStream()
        $DocumentReader = New-Object System.IO.StreamReader $DocumentStream
    
        # read entire document
        $DocumentContent = $DocumentReader.ReadToEnd()
    
        # test for each keyword
        foreach($Keyword in $Keywords)
        {
            $Pattern   = [regex]::Escape($KeyWord)
            $WordFound = $DocumentContent -match $Pattern
            if($WordFound)
            {
                $KeywordMatches[$Docx.FullName] += $Keyword
            }
        }
    
        $DocumentReader.Dispose()
        $Document.Dispose()
    }
    

    Now, you can show the word count for each document:

    $KeywordMatches.GetEnumerator() |Select File,@{n="Count";E={$_.Value.Count}}
    
    0 讨论(0)
提交回复
热议问题