Implementing run-length encoding

假装没事ソ 提交于 2019-12-06 00:53:14

Why don't you encode each $ in the original file as $$ in the compressed file?

And/or use some other character instead of $ - one that is not used much in bmp files.

Also note that the BMP format has RLE compression 'built-in' - look here, near the bottom of the page - under "Image Data and Compression".

I don't know what you're using your program for, or if it's just for learning, but if you used the "official" bmp method, your compressed images wouldn't need decompression before viewing.

AAAAAABBCDEEEEGGHJ$IIIIIIIII ==> $A6$B2CD$E4$G2HJ$$I9

If the repeat character occurs in the data, try inserting an extra repeat character in the encoded data. Then if the decoder sees a double repeat character it can insert the actual repeat character

$A6$B2CD$E4$G2HJ$$I9 ==> AAAAAABBCDEEEEGGHJ$IIIIIIIII

What most programs do to signify that some character needs to be treated literally is that they have a defined escape sequence.

For example, in regular expressions, the following are specially defined characters that usually have a meaning:

^[].*+{}()$

Yes, your fun dollar sign character is in there, and it usually means end of line.

So what a programmer using regular expressions has to do to have these characters interpreted literally is that they need to express those characters as an escape sequence. For example, to interpret $ as $, and not end of line, the programmer uses \$, which is the escape sequence.(1)

In your case, you can store literal dollar signs into your compressed file as \$.(2)

  1. NB: grep inverts this logic.

  2. The above solutions to store $ as $$ becomes confusing when you have runs of $ in the BMP file.

If you have the luxury of being able to scan the entire input before starting to compress it, you could choose the least frequent value in the input as your escape value. For example, given this input:

AAAABBCCCCDDEEEEEEEFFG

You could choose "G" as your escape value (or even "H" if it's part of your symbol set) and adopt a convention whereby the first character of the encoded stream is the escape value. So the string above might encode to:

GGA4BBGC4DDGE7FFGG

or even better:

HHA4BBHC4DDHE7FFG

Please note that there's no point in encoding a "run" of two identical values because the "compressed" version (e.g. HD2) is longer than the uncompressed version (DD).

Hope that helps!

If I understand correctly, the problem is that $ is both a symbol for marking a repeat, and also can be a 'BMP' value as well?

If so, what you could do is to mark a double $ ('$$') character to denote that the '$' character should be treated not as a repeat, but as a single '$'. This would of course mean that the '$' is expensive to encode (takes two symbols instead of 1), but would solve your problem.

If you wanted to have a run of the '$' character, you would need to encode it as: $$$5 - meaning '$' run of '$$'=$, '5' - 5 times.

I'm honestly not sure what would possessed someone to use a text-based RLE if they want to compress binary data with it. A BMP is not text.

Right now, since only a single byte is read after the $, and it is interpreted as ascii number from 0 to 9, this process has a run length range of 0 to 9, meaning you can only compress values up to 9 repetitions before a new run-length flag needs to be written. After all, you can't make the difference between $I34 for a run-length of 34, and $I3 + 4 for a literal 4 behind the repeat of 3.

If this same byte is instead interpreted as binary value, it can contain values from 0 to 255, giving a massive difference in efficiency.

As for the escaping of $ signs themselves, I'd advice either always treating it as repeat of at least 1 ($$1), or, better yet, encoding the entire thing differently, with the order of the run length values and the data swapped, so a code becomes $<length><data>; then you can use $0 as special symbol to mean 'just $'. When decompressing and encountering the 0 after a $, simply don't read on for a third byte. A run length of 0 should never appear in the compressed data anyway, so it can be given a special meaning, but this is useless if the data byte is put first, since then it'd still be the same length as a normal repeat.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!