Excel Highlight Duplicates and Filter by color alternative

后端 未结 3 2022
遇见更好的自我
遇见更好的自我 2020-12-21 07:29

My spreadsheet has about 800,000 rows with 30 columns. Customers are interested in duplicate values only in one column. They need the entire row back. For e.g.



        
相关标签:
3条回答
  • 2020-12-21 08:10

    Please disregard this submission if you are a) getting paid by the hour and feel underpaid, b) planning on a nap while the routine processes, or c) both a) and b).

    With any data set approaching 800K rows (with 30 columns) you are going to want to step into the variant array arena. With processing typically 5-7% of the time it takes to work with the worksheet values, it is very appropriate for large data blocks.

    Anytime that the word 'duplicates' comes into play, I immediately start thinking about how a Scripting.Dictionary object's unique index on its Keys can benefit. In this solution I used a pair of dictionaries to identify the rows of data with a repeated Circle Score value.

    Twenty-four million cells of data is a lot to store and transfer. Bulk methods beat individual methods every time and the bulkiest method of peeling off the data would be to stuff all 800K rows × 30 columns into a variant array. All processing becomes in-memory and the results are returned to a report worksheet en masse.

    isolateDuplicateCircleScores code

    Sub isolateDuplicateCircleScores()
        Dim d As Long, v As Long, csc As Long, stmp As String
        Dim ky As Variant, itm As Variant, vVALs As Variant, dCSs As Object, dDUPs As Object
        Dim w As Long, vWSs As Variant
        'early binding
        'dim dCSs As new scripting.dictionary, dDUPs As new scripting.dictionary
    
        appTGGL bTGGL:=False
    
        'late binding - not necessary with Early Binding (see footnote ¹)
        Set dCSs = CreateObject("Scripting.Dictionary")
        Set dDUPs = CreateObject("Scripting.Dictionary")
    
        'set to the defaults (not necessary)
        dCSs.comparemode = vbBinaryCompare
        dDUPs.comparemode = vbBinaryCompare
    
        'for testing on multiple row number scenarios
        'vWSs = Array("CircleScores_8K", "CircleScores_80K", "CircleScores_800K")
        'for runtime
        vWSs = Array("CircleScores")  '<~~ your source worksheet here
    
        For w = LBound(vWSs) To UBound(vWSs)
            'ThisWorkbook.Save
            Debug.Print vWSs(w)
            Debug.Print Timer
            With Worksheets(vWSs(w))
    
                On Error Resume Next
                Worksheets(vWSs(w) & "_dupes").Delete
                On Error GoTo 0
    
                ReDim vVALs(0)
                dCSs.RemoveAll
                dDUPs.RemoveAll
    
                'prep a new worksheet to receive the duplicates
                .Cells(1, 1).CurrentRegion.Resize(2).Copy
                With Worksheets.Add(after:=Worksheets(.Index))
                    .Name = vWSs(w) & "_dupes"
                    With .Cells(1, 1)
                        .PasteSpecial Paste:=xlPasteAllUsingSourceTheme, Operation:=xlNone
                        .PasteSpecial Paste:=xlPasteColumnWidths, Operation:=xlNone
                        .Value = .Value2
                        .Offset(1, 0).EntireRow.ClearContents
                    End With
                End With
                'finish prep with freeze row 1 and zoom to 80%
                With Application.Windows(1)
                    .SplitColumn = 0
                    .SplitRow = 1
                    .FreezePanes = True
                    .Zoom = 80
                End With
    
                'grab all of the data into a variant array
                ReDim vVALs(0)
                csc = Application.Match("CircleScore", .Rows(1), 0) 'CircleScore column number needed later
                vVALs = .Range(.Cells(2, 1), _
                               .Cells(.Cells(Rows.Count, csc).End(xlUp).Row, _
                                      .Cells(1, Columns.Count).End(xlToLeft).Column)).Value2
                'Debug.Print LBound(vVALs, 1) & ":" & UBound(vVALs, 1)  '1:~800K
                'Debug.Print LBound(vVALs, 2) & ":" & UBound(vVALs, 2)  '1:~30
            End With    'done with the original worksheet
    
            'populate the dDUPs dictionary using the key index in dCSs
            For v = LBound(vVALs, 1) To UBound(vVALs, 1)
                If dCSs.exists(vVALs(v, csc)) Then
                    stmp = vVALs(v, 1)
                    For d = LBound(vVALs, 2) + 1 To UBound(vVALs, 2)
                        stmp = Join(Array(stmp, vVALs(v, d)), ChrW(8203))
                    Next d
                    dDUPs.Add Key:=v, Item:=stmp
                    If Not dDUPs.exists(dCSs.Item(vVALs(v, csc))) Then
                        stmp = vVALs(dCSs.Item(vVALs(v, csc)), 1)
                        For d = LBound(vVALs, 2) + 1 To UBound(vVALs, 2)
                            stmp = Join(Array(stmp, vVALs(dCSs.Item(vVALs(v, csc)), d)), ChrW(8203))
                        Next d
                        dDUPs.Add Key:=dCSs.Item(vVALs(v, csc)), Item:=stmp
                    End If
                Else
                    dCSs.Item(vVALs(v, csc)) = v
                End If
            Next v
    
            'split the dDUPs dictionary items back into a variant array
            d = 1
            ReDim vVALs(1 To dDUPs.Count, 1 To UBound(vVALs, 2))
            For Each ky In dDUPs.keys
                itm = Split(dDUPs.Item(ky), ChrW(8203))
                For v = LBound(itm) To UBound(itm)
                    vVALs(d, v + 1) = itm(v)
                Next v
                d = d + 1
            Next ky
    
            'put the values into the duplicates worksheet
            With Worksheets(vWSs(w) & "_dupes")
                .Cells(2, 1).Resize(UBound(vVALs, 1), UBound(vVALs, 2)) = vVALs
                With .Cells(1, 1).CurrentRegion
                    With .Resize(.Rows.Count - 1, .Columns.Count).Offset(1, 0)
                        .Rows(1).Copy
                        .PasteSpecial Paste:=xlPasteFormats, Operation:=xlNone
                    End With
                    .Cells.Sort Key1:=.Columns(csc), Order1:=xlAscending, _
                                Key2:=.Columns(1), Order2:=xlAscending, _
                                Orientation:=xlTopToBottom, Header:=xlYes
                End With
            End With
    
            Debug.Print Timer
        Next w
    
        dCSs.RemoveAll: Set dCSs = Nothing
        dDUPs.RemoveAll: Set dDUPs = Nothing
    
        appTGGL
    End Sub
    
    Public Sub appTGGL(Optional bTGGL As Boolean = True)
        With Application
            .ScreenUpdating = bTGGL
            .EnableEvents = bTGGL
            .DisplayAlerts = bTGGL
            .AutoRecover.Enabled = bTGGL   'no interruptions with an auto-save
            .Calculation = IIf(bTGGL, xlCalculationAutomatic, xlCalculationManual)
            .CutCopyMode = False
            .StatusBar = vbNullString
        End With
        Debug.Print Timer
    End Sub
    

    Sample Data and Results

      
            800K rows × 30 columns of random sample data

      
            ~123K rows × 30 columns of duplicate rows (sorted and formatted in about a minute-and-a-half)

    Timed Results

    tbh, I never got the 32-bit version of Excel on the older laptop to run the 800K pass run more than once without restarting Excel. Once restarted the results were consistent with what is shown. The 64-bit Excel ran repeatedly without a hiccup.

            

    Large Worksheet Addendum

    When dealing with worksheets containing large data blocks there are a few general improvements that can limit your wait times. You're using Excel as a medium sized database tool so treat the data worksheet as the raw data that it should be.

    • If you are not working with a 64-bit version of Excel then you are wasting time with everything you do. See What version of Office am I using? and Choose the 32-bit or 64-bit version of Office.
    • Save as an Excel Binary Workbook (e.g. .XLSB). The file size is typically 25-35% of the original. Load times are improved and some calculation is quicker (sorry, do not have empirical timed data on the latter). Some operations that crash an .XLSX or .XLSM work fine with an .XLSB.
    • Disable Auto-Save/Auto-Recover in the options for the workbook. ([alt]+F, T, S, [alt]+D, [OK]). There are few things more irritating than waiting for an auto-save to complete when you are trying to do something. Get used to Ctrl+S when YOU want to save.
    • Avoid volatile functions¹ at all costs; particularly in formulas that are used in the full scope of the data. A single TODAY() in a COUNTIF formula filled down for the extent of the rows will have you sitting on your thumb more often than not.
    • Speaking of formulas, revert all formulas to their result values whenever possible.
    • Merged cells, conditional formatting, data validation and making cells look pretty with formatting and styles slows you down. Minimize the use of anything that takes away from raw data. It isn't like anyone is actually going to look through 800K rows of data.
    • After removing data use Home ► Editing ► Clear ► Clear All on the vacant cells. Tapping the Del only clears the contents and may not reset the Worksheet.UsedRange property; Clear All will facilitate resetting the .Used Range on the next save.
    • If you have hooped your computer with one or more Excel [Not Responding] scenarios, reboot your machine. Excel never fully recovers from these and simply restarting Excel to start over is slower and more likely to enter the same Not Responding condition later.

    ¹ If you can convert the late binding of the Scripting.Dictionary to early binding, you must add Microsoft Scripting Runtime to the VBE's Tools ► References.

    ² Volatile functions recalculate whenever anything in the entire workbook changes, not just when something that affects their outcome changes. Examples of volatile functions are INDIRECT, OFFSET, TODAY, NOW, RAND and RANDBETWEEN. Some sub-functions of the CELL and INFO worksheet functions will make them volatile as well.

    0 讨论(0)
  • 2020-12-21 08:10

    Screenshot 1

    Try this Vba-code (and learn a little bit Dutch)

    Sub DuplicatesInColumn()
    'maakt een lijst met de aangetroffen dubbelingen
    Dim LaatsteRij As Long
    Dim MatchNr As Long
    Dim iRij, iKolom, iTeller, Teller As Long, ControlKolom As Long
    iRij = 1
    
    iKolom = 5                   'number of columns in the sheet, Chance if not correct
    ControlKolom = 4             'column number where to find the doubles, Chance if not correct
    
    LaatsteRij = Cells(65000, iKolom).End(xlUp).Row: iTeller = iKolom
    
    Sheet1.Activate
    For iRij = 1 To LaatsteRij
        If Cells(iRij, ControlKolom) <> "" Then
            MatchNr = WorksheetFunction.Match(Cells(iRij, ControlKolom), Range(Cells(1, ControlKolom), Cells(LaatsteRij, ControlKolom)), 0)
        If iRij <> MatchNr Then
        iTeller = iKolom
        For Teller = 1 To iTeller
          Cells(iRij, iKolom + Teller).Offset(0, 2).Value = Range(Cells(iRij, Teller), Cells(iRij, Teller)).Value
        Next Teller
        End If: End If
    Next
    End Sub
    
    0 讨论(0)
  • 2020-12-21 08:25

    I would create an Is_Duplicated indicator column and use that to filter the duplicated CircleScores:


    UPDATE (per comments):

    Alternatively, you can sort the CircleScore column and make the formula a bit less taxing on your system (NOTE CircleScore must be sorted beforehand):

    0 讨论(0)
提交回复
热议问题