I have a large array with a range of integers that are mostly continuous, eg 1-100, 110-160, etc. All integers are positive. What would be the best algorithm to compress thi
I'd combine the answers given by CesarB and Fernando Miguélez.
First, store the differences between each value and the previous one. As CesarB pointed out, this will give you a sequence of mostly ones.
Then, use a Run Length Encoding compression algorithm on this sequence. It will compress very nicely due to the large number of repeated values.
compress the string "1-100, 110-160" or store the string in some binary representation and parse it to restore the array
I know this is an old message thread, but I am including my personal PHP test of the SKIP/TAKE idea I found here. I'm calling mine STEP(+)/SPAN(-). Perhaps someone might find it helpful.
NOTE: I implemented the ability to allow duplicate integers as well as negative integers even though the original question involved positive, non-duplicated integers. Feel free to tweak it if you want to try and shave a byte or two.
CODE:
// $integers_array can contain any integers; no floating point, please. Duplicates okay.
$integers_array = [118, 68, -9, 82, 67, -36, 15, 27, 26, 138, 45, 121, 72, 63, 73, -35,
68, 46, 37, -28, -12, 42, 101, 21, 35, 100, 44, 13, 125, 142, 36, 88,
113, -40, 40, -25, 116, -21, 123, -10, 43, 130, 7, 39, 69, 102, 24,
75, 64, 127, 109, 38, 41, -23, 21, -21, 101, 138, 51, 4, 93, -29, -13];
// Order from least to greatest... This routine does NOT save original order of integers.
sort($integers_array, SORT_NUMERIC);
// Start with the least value... NOTE: This removes the first value from the array.
$start = $current = array_shift($integers_array);
// This caps the end of the array, so we can easily get the last step or span value.
array_push($integers_array, $start - 1);
// Create the compressed array...
$compressed_array = [$start];
foreach ($integers_array as $next_value) {
// Range of $current to $next_value is our "skip" range. I call it a "step" instead.
$step = $next_value - $current;
if ($step == 1) {
// Took a single step, wait to find the end of a series of seqential numbers.
$current = $next_value;
} else {
// Range of $start to $current is our "take" range. I call it a "span" instead.
$span = $current - $start;
// If $span is positive, use "negative" to identify these as sequential numbers.
if ($span > 0) array_push($compressed_array, -$span);
// If $step is positive, move forward. If $step is zero, the number is duplicate.
if ($step >= 0) array_push($compressed_array, $step);
// In any case, we are resetting our start of potentialy sequential numbers.
$start = $current = $next_value;
}
}
// OPTIONAL: The following code attempts to compress things further in a variety of ways.
// A quick check to see what pack size we can use.
$largest_integer = max(max($compressed_array),-min($compressed_array));
if ($largest_integer < pow(2,7)) $pack_size = 'c';
elseif ($largest_integer < pow(2,15)) $pack_size = 's';
elseif ($largest_integer < pow(2,31)) $pack_size = 'l';
elseif ($largest_integer < pow(2,63)) $pack_size = 'q';
else die('Too freaking large, try something else!');
// NOTE: I did not implement the MSB feature mentioned by Marc Gravell.
// I'm just pre-pending the $pack_size as the first byte, so I know how to unpack it.
$packed_string = $pack_size;
// Save compressed array to compressed string and binary packed string.
$compressed_string = '';
foreach ($compressed_array as $value) {
$compressed_string .= ($value < 0) ? $value : '+'.$value;
$packed_string .= pack($pack_size, $value);
}
// We can possibly compress it more with gzip if there are lots of similar values.
$gz_string = gzcompress($packed_string);
// These were all just size tests I left in for you.
$base64_string = base64_encode($packed_string);
$gz64_string = base64_encode($gz_string);
$compressed_string = trim($compressed_string,'+'); // Don't need leading '+'.
echo "<hr>\nOriginal Array has "
.count($integers_array)
.' elements: {not showing, since I modified the original array directly}';
echo "<br>\nCompressed Array has "
.count($compressed_array).' elements: '
.implode(', ',$compressed_array);
echo "<br>\nCompressed String has "
.strlen($compressed_string).' characters: '
.$compressed_string;
echo "<br>\nPacked String has "
.strlen($packed_string).' (some probably not printable) characters: '
.$packed_string;
echo "<br>\nBase64 String has "
.strlen($base64_string).' (all printable) characters: '
.$base64_string;
echo "<br>\nGZipped String has "
.strlen($gz_string).' (some probably not printable) characters: '
.$gz_string;
echo "<br>\nBase64 of GZipped String has "
.strlen($gz64_string).' (all printable) characters: '
.$gz64_string;
// NOTICE: The following code reverses the process, starting form the $compressed_array.
// The first value is always the starting value.
$current_value = array_shift($compressed_array);
$uncompressed_array = [$current_value];
foreach ($compressed_array as $val) {
if ($val < -1) {
// For ranges that span more than two values, we have to fill in the values.
$range = range($current_value + 1, $current_value - $val - 1);
$uncompressed_array = array_merge($uncompressed_array, $range);
}
// Add the step value to the $current_value
$current_value += abs($val);
// Add the newly-determined $current_value to our list. If $val==0, it is a repeat!
array_push($uncompressed_array, $current_value);
}
// Display the uncompressed array.
echo "<hr>Reconstituted Array has "
.count($uncompressed_array).' elements: '
.implode(', ',$uncompressed_array).
'<hr>';
OUTPUT:
--------------------------------------------------------------------------------
Original Array has 63 elements: {not showing, since I modified the original array directly}
Compressed Array has 53 elements: -40, 4, -1, 6, -1, 3, 2, 2, 0, 8, -1, 2, -1, 13, 3, 6, 2, 6, 0, 3, 2, -1, 8, -11, 5, 12, -1, 3, -1, 0, -1, 3, -1, 2, 7, 6, 5, 7, -1, 0, -1, 7, 4, 3, 2, 3, 2, 2, 2, 3, 8, 0, 4
Compressed String has 110 characters: -40+4-1+6-1+3+2+2+0+8-1+2-1+13+3+6+2+6+0+3+2-1+8-11+5+12-1+3-1+0-1+3-1+2+7+6+5+7-1+0-1+7+4+3+2+3+2+2+2+3+8+0+4
Packed String has 54 (some probably not printable) characters: cØÿÿÿÿ ÿõ ÿÿÿÿÿÿ
Base64 String has 72 (all printable) characters: Y9gE/wb/AwICAAj/Av8NAwYCBgADAv8I9QUM/wP/AP8D/wIHBgUH/wD/BwQDAgMCAgIDCAAE
GZipped String has 53 (some probably not printable) characters: xœ Ê» ÑÈί€)YšE¨MŠ“^qçºR¬m&Òõ‹%Ê&TFʉùÀ6ÿÁÁ Æ
Base64 of GZipped String has 72 (all printable) characters: eJwNyrsNACAMA9HIzq+AKVmaRahNipNecee6UgSsBW0m0gj1iyXKJlRGjcqJ+cA2/8HBDcY=
--------------------------------------------------------------------------------
Reconstituted Array has 63 elements: -40, -36, -35, -29, -28, -25, -23, -21, -21, -13, -12, -10, -9, 4, 7, 13, 15, 21, 21, 24, 26, 27, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 51, 63, 64, 67, 68, 68, 69, 72, 73, 75, 82, 88, 93, 100, 101, 101, 102, 109, 113, 116, 118, 121, 123, 125, 127, 130, 138, 138, 142
--------------------------------------------------------------------------------