问题
I have a program to remove duplicates and everything is working properly. It is just freezing with large data sets i.e. 1 to 2.5 million words.
What is wrong with my approach? Is there a better one?
Sub DeleteDuplicateParagraphs()
Dim p1 As Paragraph
Dim p2 As Paragraph
Dim DupCount As Long
Dim StartTime As Double
Dim SecondsElapsed As Double
StartTime = Timer
For Each p1 In ActiveDocument.Paragraphs
If p1.range.Text <> vbCr Then
For Each p2 In ActiveDocument.Paragraphs
If p1.range.Text = p2.range.Text Then
DupCount = DupCount + 1
If p1.range.Text = p2.range.Text And DupCount > 1 Then p2.range.Delete
End If
Next p2
End If
DupCount = 0
Next p1
SecondsElapsed = Round(Timer - StartTime, 2)
MsgBox "This code ran successfully in " & SecondsElapsed & " seconds", vbInformation
DupCount = 0
End Sub
回答1:
Try this (first add a reference to the Microsoft Scripting Runtime to your VBA project):
Sub DeleteDuplicateParagraphs()
Dim p As Paragraph
Dim d As New Scripting.Dictionary
Dim t As Variant
Dim i As Integer
Dim StartTime As Single
StartTime = Timer
' collect duplicates
For Each p In ActiveDocument.Paragraphs
t = p.Range.Text
If t <> vbCr Then
If Not d.Exists(t) Then d.Add t, New Scripting.Dictionary
d(t).Add d(t).Count + 1, p
End If
Next
' eliminate duplicates
Application.ScreenUpdating = False
For Each t In d
For i = 2 To d(t).Count
d(t)(i).Range.Delete
Next
Next
Application.ScreenUpdating = True
MsgBox "This code ran successfully in " & Round(Timer - StartTime, 2) & " seconds", vbInformation
End Sub
This makes use of the fact that the Scripting.Dictionary
is a hash table that is geared towards very quickly associating unique keys with values. It is therefore very good at spotting duplicate keys. Dictionary keys have to be strings, conveniently we can use the paragraph texts for that.
For values we use more dictionary objects, solely for the fact that they work a lot better than VBA's arrays. In them we collect the references to the actual paragraph instances with the same text.
Actually deleting duplicate paragraphs is a very simple matter afterwards.
Note: The duplicate detection part in the above code is very fast. However, if Word becomes unresponsive in large documents then it's in the duplicate removal part, namely because of Word's undo buffer.
The culprit is that the paragraph ranges are deleted one after another, causing Word to build a very large undo buffer. Unfortunately there is no way (that I know of) to either
- delete multiple separate ranges in one step (which would result in only a single entry in the undo buffer), or
- disable the undo buffer altogether from VBA
Calling UndoClear
periodically in the "eliminate duplicates" loop might help, disabling ScreenUpdating
is also not a bad idea:
' eliminate duplicates
Dim x As Integer
Application.ScreenUpdating = False
For Each t In d
x = x + 1
For i = 2 To d(t).Count
d(t)(i).Range.Delete
Next
If x Mod 50 = 0 Then ActiveDocument.UndoClear
Next
ActiveDocument.UndoClear
Application.ScreenUpdating = True
回答2:
First of all, Just wanted to thank you so much for the time and effort you have put in to helping me.
Your idea behind the method is really impressive. I did change the code slightly and would like you to peruse it when you have the time, to see if it is of optimal standard. Again, I truly thank you, the code ran 20 splits faster than the previous and that is not even over a larger data set.
> Sub DeleteDuplicateParagraphs()
>
> Dim p As Paragraph
> Set d = CreateObject("Scripting.Dictionary")
> Dim t As Variant
> Dim i As Integer
> Dim StartTime As Single
>
> StartTime = Timer
>
> ' collect duplicates For Each p In ActiveDocument.Paragraphs
> t = p.range.Text
> If t <> vbCr Then
> If Not d.Exists(t) Then d.Add t, CreateObject("Scripting.Dictionary")
> d(t).Add d(t).Count + 1, p
> End If Next
>
> ' eliminate duplicates For Each t In d
> For i = 2 To d(t).Count
> d(t)(i).range.Delete
> Next Next
>
> MsgBox "This code ran successfully in " & Round(Timer - StartTime,
> 2) & " seconds", vbInformation
>
> End Sub
来源:https://stackoverflow.com/questions/33562468/duplicate-removal-for-vba-word-not-working-effectively