Duplicate removal for VBA Word not working effectively

本小妞迷上赌 提交于 2019-12-23 02:31:10

问题


I have a program to remove duplicates and everything is working properly. It is just freezing with large data sets i.e. 1 to 2.5 million words.

What is wrong with my approach? Is there a better one?

Sub DeleteDuplicateParagraphs()

Dim p1 As Paragraph
Dim p2 As Paragraph
Dim DupCount As Long
Dim StartTime As Double
Dim SecondsElapsed As Double

StartTime = Timer

For Each p1 In ActiveDocument.Paragraphs
  If p1.range.Text <> vbCr Then

    For Each p2 In ActiveDocument.Paragraphs
      If p1.range.Text = p2.range.Text Then
        DupCount = DupCount + 1
        If p1.range.Text = p2.range.Text And DupCount > 1 Then p2.range.Delete
      End If
    Next p2
  End If
  DupCount = 0
Next p1

SecondsElapsed = Round(Timer - StartTime, 2)

MsgBox "This code ran successfully in " & SecondsElapsed & " seconds", vbInformation
DupCount = 0

End Sub

回答1:


Try this (first add a reference to the Microsoft Scripting Runtime to your VBA project):

Sub DeleteDuplicateParagraphs()
  Dim p As Paragraph
  Dim d As New Scripting.Dictionary
  Dim t As Variant
  Dim i As Integer
  Dim StartTime As Single

  StartTime = Timer

  ' collect duplicates
  For Each p In ActiveDocument.Paragraphs
    t = p.Range.Text
    If t <> vbCr Then
      If Not d.Exists(t) Then d.Add t, New Scripting.Dictionary
      d(t).Add d(t).Count + 1, p
    End If
  Next

  ' eliminate duplicates
  Application.ScreenUpdating = False
  For Each t In d
    For i = 2 To d(t).Count
      d(t)(i).Range.Delete
    Next
  Next
  Application.ScreenUpdating = True

  MsgBox "This code ran successfully in " & Round(Timer - StartTime, 2) & " seconds", vbInformation
End Sub

This makes use of the fact that the Scripting.Dictionary is a hash table that is geared towards very quickly associating unique keys with values. It is therefore very good at spotting duplicate keys. Dictionary keys have to be strings, conveniently we can use the paragraph texts for that.

For values we use more dictionary objects, solely for the fact that they work a lot better than VBA's arrays. In them we collect the references to the actual paragraph instances with the same text.

Actually deleting duplicate paragraphs is a very simple matter afterwards.


Note: The duplicate detection part in the above code is very fast. However, if Word becomes unresponsive in large documents then it's in the duplicate removal part, namely because of Word's undo buffer.

The culprit is that the paragraph ranges are deleted one after another, causing Word to build a very large undo buffer. Unfortunately there is no way (that I know of) to either

  • delete multiple separate ranges in one step (which would result in only a single entry in the undo buffer), or
  • disable the undo buffer altogether from VBA

Calling UndoClear periodically in the "eliminate duplicates" loop might help, disabling ScreenUpdating is also not a bad idea:

' eliminate duplicates
Dim x As Integer

Application.ScreenUpdating = False
For Each t In d
  x = x + 1
  For i = 2 To d(t).Count
    d(t)(i).Range.Delete
  Next
  If x Mod 50 = 0 Then ActiveDocument.UndoClear
Next
ActiveDocument.UndoClear
Application.ScreenUpdating = True



回答2:


First of all, Just wanted to thank you so much for the time and effort you have put in to helping me.

Your idea behind the method is really impressive. I did change the code slightly and would like you to peruse it when you have the time, to see if it is of optimal standard. Again, I truly thank you, the code ran 20 splits faster than the previous and that is not even over a larger data set.

   >  Sub DeleteDuplicateParagraphs()
    > 
    >   Dim p As Paragraph 

>     Set d = CreateObject("Scripting.Dictionary")  
>         Dim t As Variant   
>     Dim i As Integer   
>     Dim StartTime As Single
    > 
    >   StartTime = Timer
    > 
    >   ' collect duplicates   For Each p In ActiveDocument.Paragraphs
    >     t = p.range.Text
    >     If t <> vbCr Then
    >       If Not d.Exists(t) Then d.Add t, CreateObject("Scripting.Dictionary")
    >       d(t).Add d(t).Count + 1, p
    >     End If   Next
    > 
    >   ' eliminate duplicates   For Each t In d
    >     For i = 2 To d(t).Count
    >       d(t)(i).range.Delete
    >     Next   Next
    > 
    >   MsgBox "This code ran successfully in " & Round(Timer - StartTime,
    > 2) & " seconds", vbInformation
    > 
    > End Sub


来源:https://stackoverflow.com/questions/33562468/duplicate-removal-for-vba-word-not-working-effectively

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!