Huffman encoding - header & EOF

偶尔善良 提交于 2019-12-01 10:59:51

I've had to do this once for an assignment and this is the approach we used.

Encoding the header was done by using 0 and 1 to encode the structure of the tree (rather than the frequencies). A "0" denoted moving along the tree, a "1" denoted we were at a leaf node. This resulted in a sort of pre-order traversal of the tree encoding it uniquely.

For example, a tree like (((a b) c) (d e)) would be encoded as "0001a1b1c01d1e", where a,b,c,d,e are their ASCII forms.

Here's the tree in a graphical form:

     / \
   /\   /\
 /\  c d  e
a  b 

For the EOF we used the last 3 bits in the file to specify how many of the last two bytes needed to be read. Once we read the last byte (so we were working on the second last byte) we checked the last 3 bits: They encoded how many more bits to read, minus 6. So 110101xxxxxxx000 meant "read 110101 (6 bits) and discard everything else". 1101011xxxxxx001 meant "read 1101011 (7 bits) and discard the rest", etc.

Doing it this way meant we didn't have to have a special value denoting the EOF and we could still read the file a byte at a time (although we actually needed to read one byte ahead of where we were working).

(For the EOF I haven't read your articles, so I don't know if our idea works with your approach.)

Huffman encoding specifies how to create the Huffman tree from some sequence of characters and then how to encode that into a sequence of bits.

It doesn't specify how should you encode the tree or how to figure out exactly how many bits to read. The exact count of bits is a problem, because you can save only whole bytes into a file. And so you need some way to figure out exactly at which bit to end.

For encoding the tree, there are several options. One of them is recording the count for each character and letting the decoder reconstruct the tree from that. Other option is to directly encode the tree somehow, for example using the 0-1 approach mange (and I assume the articles you mentioned) describes.

Then there is adaptive Huffman coding which doesn't require the tree at all, but is more complicated.

For deciding when to end, you could write the total count of characters into the file and use that to decide when to stop. Or you could get this count for free if you encoded the tree using character counts.

Another option is to use an EOF character. This is a character that is in your Huffman tree, but doesn't encode any normal value. You could imagine it as a 257th value, assuming you are encoding bytes. (You could use some normal value, like the zero byte, for the EOF token, but that would require you to be absolutely sure it won't be present in the input file.)

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!