Best Compression algorithm for a sequence of integers

前端 未结 15 1584
离开以前
离开以前 2020-11-29 16:41

I have a large array with a range of integers that are mostly continuous, eg 1-100, 110-160, etc. All integers are positive. What would be the best algorithm to compress thi

相关标签:
15条回答
  • 2020-11-29 17:40

    I'd combine the answers given by CesarB and Fernando Miguélez.

    First, store the differences between each value and the previous one. As CesarB pointed out, this will give you a sequence of mostly ones.

    Then, use a Run Length Encoding compression algorithm on this sequence. It will compress very nicely due to the large number of repeated values.

    0 讨论(0)
  • 2020-11-29 17:42

    compress the string "1-100, 110-160" or store the string in some binary representation and parse it to restore the array

    0 讨论(0)
  • 2020-11-29 17:46

    I know this is an old message thread, but I am including my personal PHP test of the SKIP/TAKE idea I found here. I'm calling mine STEP(+)/SPAN(-). Perhaps someone might find it helpful.

    NOTE: I implemented the ability to allow duplicate integers as well as negative integers even though the original question involved positive, non-duplicated integers. Feel free to tweak it if you want to try and shave a byte or two.

    CODE:

      // $integers_array can contain any integers; no floating point, please. Duplicates okay.
      $integers_array = [118, 68, -9, 82, 67, -36, 15, 27, 26, 138, 45, 121, 72, 63, 73, -35,
                        68, 46, 37, -28, -12, 42, 101, 21, 35, 100, 44, 13, 125, 142, 36, 88,
                        113, -40, 40, -25, 116, -21, 123, -10, 43, 130, 7, 39, 69, 102, 24,
                        75, 64, 127, 109, 38, 41, -23, 21, -21, 101, 138, 51, 4, 93, -29, -13];
    
      // Order from least to greatest... This routine does NOT save original order of integers.
      sort($integers_array, SORT_NUMERIC); 
    
      // Start with the least value... NOTE: This removes the first value from the array.
      $start = $current = array_shift($integers_array);    
    
      // This caps the end of the array, so we can easily get the last step or span value.
      array_push($integers_array, $start - 1);
    
      // Create the compressed array...
      $compressed_array = [$start];
      foreach ($integers_array as $next_value) {
        // Range of $current to $next_value is our "skip" range. I call it a "step" instead.
        $step = $next_value - $current;
        if ($step == 1) {
            // Took a single step, wait to find the end of a series of seqential numbers.
            $current = $next_value;
        } else {
            // Range of $start to $current is our "take" range. I call it a "span" instead.
            $span = $current - $start;
            // If $span is positive, use "negative" to identify these as sequential numbers. 
            if ($span > 0) array_push($compressed_array, -$span);
            // If $step is positive, move forward. If $step is zero, the number is duplicate.
            if ($step >= 0) array_push($compressed_array, $step);
            // In any case, we are resetting our start of potentialy sequential numbers.
            $start = $current = $next_value;
        }
      }
    
      // OPTIONAL: The following code attempts to compress things further in a variety of ways.
    
      // A quick check to see what pack size we can use.
      $largest_integer = max(max($compressed_array),-min($compressed_array));
      if ($largest_integer < pow(2,7)) $pack_size = 'c';
      elseif ($largest_integer < pow(2,15)) $pack_size = 's';
      elseif ($largest_integer < pow(2,31)) $pack_size = 'l';
      elseif ($largest_integer < pow(2,63)) $pack_size = 'q';
      else die('Too freaking large, try something else!');
    
      // NOTE: I did not implement the MSB feature mentioned by Marc Gravell.
      // I'm just pre-pending the $pack_size as the first byte, so I know how to unpack it.
      $packed_string = $pack_size;
    
      // Save compressed array to compressed string and binary packed string.
      $compressed_string = '';
      foreach ($compressed_array as $value) {
          $compressed_string .= ($value < 0) ? $value : '+'.$value;
          $packed_string .= pack($pack_size, $value);
      }
    
      // We can possibly compress it more with gzip if there are lots of similar values.      
      $gz_string = gzcompress($packed_string);
    
      // These were all just size tests I left in for you.
      $base64_string = base64_encode($packed_string);
      $gz64_string = base64_encode($gz_string);
      $compressed_string = trim($compressed_string,'+');  // Don't need leading '+'.
      echo "<hr>\nOriginal Array has "
        .count($integers_array)
        .' elements: {not showing, since I modified the original array directly}';
      echo "<br>\nCompressed Array has "
        .count($compressed_array).' elements: '
        .implode(', ',$compressed_array);
      echo "<br>\nCompressed String has "
        .strlen($compressed_string).' characters: '
        .$compressed_string;
      echo "<br>\nPacked String has "
        .strlen($packed_string).' (some probably not printable) characters: '
        .$packed_string;
      echo "<br>\nBase64 String has "
        .strlen($base64_string).' (all printable) characters: '
        .$base64_string;
      echo "<br>\nGZipped String has "
        .strlen($gz_string).' (some probably not printable) characters: '
        .$gz_string;
      echo "<br>\nBase64 of GZipped String has "
        .strlen($gz64_string).' (all printable) characters: '
        .$gz64_string;
    
      // NOTICE: The following code reverses the process, starting form the $compressed_array.
    
      // The first value is always the starting value.
      $current_value = array_shift($compressed_array);
      $uncompressed_array = [$current_value];
      foreach ($compressed_array as $val) {
        if ($val < -1) {
          // For ranges that span more than two values, we have to fill in the values.
          $range = range($current_value + 1, $current_value - $val - 1);
          $uncompressed_array = array_merge($uncompressed_array, $range);
        }
        // Add the step value to the $current_value
        $current_value += abs($val); 
        // Add the newly-determined $current_value to our list. If $val==0, it is a repeat!
        array_push($uncompressed_array, $current_value);      
      }
    
      // Display the uncompressed array.
      echo "<hr>Reconstituted Array has "
        .count($uncompressed_array).' elements: '
        .implode(', ',$uncompressed_array).
        '<hr>';
    

    OUTPUT:

    --------------------------------------------------------------------------------
    Original Array has 63 elements: {not showing, since I modified the original array directly}
    Compressed Array has 53 elements: -40, 4, -1, 6, -1, 3, 2, 2, 0, 8, -1, 2, -1, 13, 3, 6, 2, 6, 0, 3, 2, -1, 8, -11, 5, 12, -1, 3, -1, 0, -1, 3, -1, 2, 7, 6, 5, 7, -1, 0, -1, 7, 4, 3, 2, 3, 2, 2, 2, 3, 8, 0, 4
    Compressed String has 110 characters: -40+4-1+6-1+3+2+2+0+8-1+2-1+13+3+6+2+6+0+3+2-1+8-11+5+12-1+3-1+0-1+3-1+2+7+6+5+7-1+0-1+7+4+3+2+3+2+2+2+3+8+0+4
    Packed String has 54 (some probably not printable) characters: cØÿÿÿÿ ÿõ ÿÿÿÿÿÿ
    Base64 String has 72 (all printable) characters: Y9gE/wb/AwICAAj/Av8NAwYCBgADAv8I9QUM/wP/AP8D/wIHBgUH/wD/BwQDAgMCAgIDCAAE
    GZipped String has 53 (some probably not printable) characters: xœ Ê» ÑÈί€)YšE¨MŠ“^qçºR¬m&Òõ‹%Ê&TFʉùÀ6ÿÁÁ Æ
    Base64 of GZipped String has 72 (all printable) characters: eJwNyrsNACAMA9HIzq+AKVmaRahNipNecee6UgSsBW0m0gj1iyXKJlRGjcqJ+cA2/8HBDcY=
    --------------------------------------------------------------------------------
    Reconstituted Array has 63 elements: -40, -36, -35, -29, -28, -25, -23, -21, -21, -13, -12, -10, -9, 4, 7, 13, 15, 21, 21, 24, 26, 27, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 51, 63, 64, 67, 68, 68, 69, 72, 73, 75, 82, 88, 93, 100, 101, 101, 102, 109, 113, 116, 118, 121, 123, 125, 127, 130, 138, 138, 142
    --------------------------------------------------------------------------------
    
    0 讨论(0)
提交回复
热议问题