I\'m trying to search for keywords across a large number of MS Word documents, and return the results to a file. I\'ve got a working script, but I wasn\'t aware of the scale
Right now you're doing this (pseudocode):
foreach $Keyword {
create Word Application
foreach $File {
load Word Document from $File
find $Keyword
}
}
That means that if you have a 100 keywords and 10 documents, you're opening and closing a 100 instances of Word and loading in a thousand word documents before you're done.
Do this instead:
create Word Application
foreach $File {
load Word Document from $File
foreach $Keyword {
find $Keyword
}
}
So you only launch one instance of Word and only load each document once.
As noted in the comments, you may optimize the whole process by using the OpenXML SDK, rather than launching Word:
(assuming you've installed OpenXML SDK in its default location)
# Import the OpenXML library
Add-Type -Path 'C:\Program Files (x86)\Open XML SDK\V2.5\lib\DocumentFormat.OpenXml.dll'
# Grab the keywords and file names
$Keywords = Get-Content C:\scratch\CompareData.txt
$Documents = Get-childitem -path $Path -Recurse -Include *.docx
# hashtable to store results per document
$KeywordMatches = @{}
# store OpenXML word document type in variable as a shorthand
$WordDoc = [DocumentFormat.OpenXml.Packaging.WordprocessingDocument] -as [type]
foreach($Docx in $Docs)
{
# create array to hold matched keywords
$KeywordMatches[$Docx.FullName] = @()
# open document, wrap content stream in streamreader
$Document = $WordDoc::Open($Docx.FullName, $false)
$DocumentStream = $Document.MainDocumentPart.GetStream()
$DocumentReader = New-Object System.IO.StreamReader $DocumentStream
# read entire document
$DocumentContent = $DocumentReader.ReadToEnd()
# test for each keyword
foreach($Keyword in $Keywords)
{
$Pattern = [regex]::Escape($KeyWord)
$WordFound = $DocumentContent -match $Pattern
if($WordFound)
{
$KeywordMatches[$Docx.FullName] += $Keyword
}
}
$DocumentReader.Dispose()
$Document.Dispose()
}
Now, you can show the word count for each document:
$KeywordMatches.GetEnumerator() |Select File,@{n="Count";E={$_.Value.Count}}