Best way to find differences between two large arrays in PHP

后端 未结 2 906
Happy的楠姐
Happy的楠姐 2020-12-25 07:58

I have 2 very large arrays (of size ~2,500,000). I need to find difference between these arrays. By difference I mean I need a resultant array with values that are in array

相关标签:
2条回答
  • 2020-12-25 08:57

    This is the simple algorithm.

    1. Flip 1st array. Values will become keys. So repeated values will be discarded.
    2. Flip 2nd array (optional)
    3. Check for each element in 2nd array if it exists in 1st array.

    As you are working with very large arrays, it'll consume a lot of memory.

    Here is my implementation,

    $a = file("l.a"); // l.a is a file contains 2,500,000 lines
    $b = file("l.b");
    
    function large_array_diff($b, $a){
        // Flipping 
        $at = array_flip($a);
        $bt = array_flip($b); 
        // checking
        $d = array_diff_key($bt, $at);
    
        return array_keys($d);   
    }
    

    I ran it using 4G memory limit. 3G also works. Just tested.

    $ time php -d memory_limit=4G diff_la.php
    

    It took about 11 seconds!.

    real    0m10.612s
    user    0m8.940s
    sys     0m1.460s
    

    UPDATE

    Following code runs 2x faster than large_array_diff function stated above.

    function flip_isset_diff($b, $a) {
        $at = array_flip($a);
        $d = array();
        foreach ($b as $i)
            if (!isset($at[$i])) 
                $d[] = $i;
    
        return $d;
    }
    

    Because it does not call array_flip (1 time), array_diff_key and array_keys. Lots of CPU cycles are saved due to this.

    0 讨论(0)
  • 2020-12-25 08:58

    OK, updated, casting to string for good measure (it makes a LOT of difference if we could use integers instead of strings...):

    <?php
    $start = microtime(true);
    echo 'creating' . PHP_EOL;
    $array1 = array();
    $array2 = array();
    for ($i = 0;$i < 2000000;$i++) {
        $array1[] = (string)rand(100000000, 999999999);
        $array2[] = (string)rand(100000000, 999999999);
    }
    echo (microtime(true) - $start) . PHP_EOL;
    $start = microtime(true);
    echo 'key diff flip' . PHP_EOL;
    $array1f = array_flip($array1);
    $array2f = array_flip($array2);
    echo (microtime(true) - $start) . PHP_EOL;
    $start = microtime(true);
    echo 'key diff' . PHP_EOL;
    $diff = array_diff_key($array1f, $array2f);
    echo (microtime(true) - $start) . PHP_EOL;
    $start = microtime(true);
    echo 'sorting' . PHP_EOL;
    $start = microtime(true);
    sort($array1);
    sort($array2);
    $notin2 = array();
    reset($array2);
    echo (microtime(true) - $start) . PHP_EOL;
    echo 'comparing' . PHP_EOL;
    $start = microtime(true);
    foreach ($array1 as $value) {
        while (current($array2) < $value) {
            if (next($array2) == false) break;
        }
        if (current($array2) != $value) $notin2[] = $value;
    }
    echo (microtime(true) - $start) . PHP_EOL;
    echo 'plain diff' . PHP_EOL;
    $start = microtime(true);
    $diff = array_diff($array1, $array2);
    echo (microtime(true) - $start) . PHP_EOL;
    ?>
    

    Strings

    creating
    8.66658186913
    key diff flip
    3.07359004021
    key diff
    1.48775887489
    sorting
    48.4047858715
    comparing
    9.41756916046
    plain diff
    19.21976614
    
    real    1m37.574s
    user    1m34.970s
    sys     0m1.676s
    

    Integers (removed (string)):

    creating
    4.69975805283
    key diff flip
    2.5843539238
    key diff
    1.4868490696
    sorting
    15.2628200054
    comparing
    5.62516498566
    plain diff
    101.688895941
    
    real    2m18.090s
    user    2m15.112s
    sys     0m1.356s
    

    Much to my surprise..... array_diff hates integers it seems...

    0 讨论(0)
提交回复
热议问题