Script to find duplicates in a csv file

前端 未结 5 1173
旧巷少年郎
旧巷少年郎 2021-01-17 17:02

I have a 40 MB csv file with 50,000 records. Its a giant product listing. Each row has close to 20 fields. [Item#, UPC, Desc, etc]

How can I,

a) Find and Pri

5条回答
  •  暗喜
    暗喜 (楼主)
    2021-01-17 17:47

    For the second part: read the file with Text::CSV into a hash keyed on your unique key(s), check whether a value exists for the hash before adding it. Something like this:

    data (doesn't need to be sorted), in this example we need the first two columns to be unique:

    1142,X426,Name1,Thing1
    1142,X426,Name2,Thing2
    1142,X426,Name3,Thing3
    1142,X426,Name4,Thing4
    1144,X427,Name5,Thing5
    1144,X427,Name6,Thing6
    1144,X427,Name7,Thing7
    1144,X427,Name8,Thing8
    

    code:

    use strict;
    use warnings;
    use Text::CSV;
    
    my %data;
    my %dupes;
    my @rows;
    my $csv = Text::CSV->new ()
                            or die "Cannot use CSV: ".Text::CSV->error_diag ();
    
    open my $fh, "<", "data.csv" or die "data.csv: $!";
    while ( my $row = $csv->getline( $fh ) ) {
        # insert row into row list  
        push @rows, $row;
        # join the unique keys with the
        # perl 'multidimensional array emulation' 
        # subscript  character
        my $key = join( $;, @{$row}[0,1] ); 
        # if it was just one field, just use
        # my $key = $row->[$keyfieldindex];
        # if you were checking for full line duplicates (header lines):
        # my $key = join($;, @$row);
        # if %data has an entry for the record, add it to dupes
        if (exists $data{$key}) { # duplicate 
            # if it isn't already duplicated
            # add this row and the original 
            if (not exists $dupes{$key}) {
                push @{$dupes{$key}}, $data{$key};
            }
            # add the duplicate row
            push @{$dupes{$key}}, $row;
        } else {
            $data{ $key } = $row;
        }
    }
    
    $csv->eof or $csv->error_diag();
    close $fh;
    # print out duplicates:
    warn "Duplicate Values:\n";
    warn "-----------------\n";
    foreach my $key (keys %dupes) {
        my @keys = split($;, $key);
        warn "Key: @keys\n";
        foreach my $dupe (@{$dupes{$key}}) {
            warn "\tData: @$dupe\n";
        }
    }
    

    Which prints out something like this:

    Duplicate Values:
    -----------------
    Key: 1142 X426
        Data: 1142 X426 Name1 Thing1
        Data: 1142 X426 Name2 Thing2
        Data: 1142 X426 Name3 Thing3
        Data: 1142 X426 Name4 Thing4
    Key: 1144 X427
        Data: 1144 X427 Name5 Thing5
        Data: 1144 X427 Name6 Thing6
        Data: 1144 X427 Name7 Thing7
        Data: 1144 X427 Name8 Thing8
    

提交回复
热议问题