Chained Hash Table and understanding Deflate

问题

I am currently trying to create a custom Deflate implementation in C#.

I am currently trying to implement the "pattern search" part where I have (up to) 32k of data and am trying to search the longest possible pattern for my input.

The RFC 1951 which defines Deflate says about that process:

The compressor uses a chained hash table to find duplicated strings, using a hash function that operates on 3-byte sequences. At any given point during compression, let XYZ be the next 3 input bytes to be examined (not necessarily all different, of course). First, the compressor examines the hash chain for XYZ. If the chain is empty, the compressor simply writes out X as a literal byte and advances one byte in the input. If the hash chain is not empty, indicating that the sequence XYZ (or, if we are unlucky, some other 3 bytes with the same hash function value) has occurred recently, the compressor compares all strings on the XYZ hash chain with the actual input data sequence starting at the current point, and selects the longest match.

I do know what a hash function is, and do know what a HashTable is as well. But what is a "chained hash table" and how could such a structure be designed to be efficient (in C#) with handling a large amout of data? Unforunately I didn't understand how the structure described in the RFC works.

What kind of hash function could I choose (what would make sense)?

Thank you in advance!

回答1:

A chained hash table is a hash table that stores every item you put in it, even if the key for 2 items hashes to the same value, or even if 2 items have exactly the same key.

A DEFLATE implementation needs to store a bunch of (key, data) items in no particular order, and rapidly look-up a list of all the items with that key. In this case, the key is 3 consecutive bytes of uncompressed plaintext, and the data is some sort of pointer or offset to where that 3-byte substring occurs in the plaintext.

Many hashtable/dictionary implementations store both the key and the data for every item. It's not necessary to store the key in the table for DEFLATE, but it doesn't hurt anything other than using slightly more memory during compression.

Some hashtable/dictionary implementations such as the C++ STL unordered_map insist that every (key, data) item they store must have a unique key. When you try to store another (key, data) item with the same key as some older item already in the table, these implementations delete the old item and replace it with the new item. That does hurt -- if you accidentally use the C++ STL unordered_map or similar implementation, your compressed file will be larger than if you had used a more appropriate library such as the C++ STL hash_multimap. Such an error may be difficult to detect, since the resulting (unnecessarily large) compressed files can be correctly decompressed by any standard DEFLATE compressor to a file bit-for-bit identical to the original file. A few implementations of DEFLATE and other compression algorithms deliberately use such an implementation, deliberately sacrificing compressed file size in order to gain compression speed.

As Nick Johnson said, the default hash function used in your standard "hashtable" or "dictionary" implementation is probably more than adequate.

http://en.wikipedia.org/wiki/Hashtable#Separate_chaining

回答2:

In this case, they're describing a hashtable where each element contains a list of strings - in this case, all the strings starting with the three character prefix specified. You should simply be able to use standard .net hashtable or dictionary primitives - there's no need to replicate their exact implementation details.

32k is not a lot of data, so you don't have to worry about scaling your hashtable - and even if you did, the built-in primitives are likely to be more efficient than anything you could write yourself.

来源：https://stackoverflow.com/questions/6831399/chained-hash-table-and-understanding-deflate

标签

algorithm

compression