Noticing that byte-pair encoding (BPE) is sorely lacking from the large text compression benchmark, I very quickly made a trivial literal implementation of
You can also optimize the dictionary so that:
AA1BB2CC3DD4EE5FF6GG7HH8
is a sequential run of 8 token.
Rewrite that as:
AA1<255>BBCCDDEEFFGGHH<255>
where the <255>
tells the program that each of the following byte pairs (up to the next <255>
) are sequential and incremented by one. Works great for text
files and any where there are at least 4 sequential tokens.
save 175 bytes on recent test.