What's the most efficient way to check for duplicates in an array of data using Perl?

前端 未结 7 1599
花落未央
花落未央 2020-12-05 14:26

I need to see if there are duplicates in an array of strings, what\'s the most time-efficient way of doing it?

7条回答
  •  夕颜
    夕颜 (楼主)
    2020-12-05 15:15

    Turning the array into a hash is the fastest way [O(n)], though its memory inefficient. Using a for loop is a bit faster than grep, but I'm not sure why.

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    
    my %count;
    my %dups;
    for(@array) {
        $dups{$_}++ if $count{$_}++;
    }
    

    A memory efficient way is to sort the array in place and iterate through it looking for equal and adjacent entries.

    # not exactly sort in place, but Perl does a decent job optimizing it
    @array = sort @array;
    
    my $last;
    my %dups;
    for my $entry (@array) {
        $dups{$entry}++ if defined $last and $entry eq $last;
        $last = $entry;
    }
    

    This is nlogn speed, because of the sort, but only needs to store the duplicates rather than a second copy of the data in %count. Worst case memory usage is still O(n) (when everything is duplicated) but if your array is large and there's not a lot of duplicates you'll win.

    Theory aside, benchmarking shows the latter starts to lose on large arrays (like over a million) with a high percentage of duplicates.

提交回复
热议问题