Explaining The Count Sketch Algorithm

前端 未结 3 1781
孤街浪徒
孤街浪徒 2020-12-07 11:50

Can someone explain how the Count Sketch Algorithm works? I still can\'t figure out how hashes are used, for example. I have a hard time understanding this paper.

3条回答
  •  攒了一身酷
    2020-12-07 12:26

    Count Sketch is a probabilistic data structure which allows you to answer the following question:

    Reading a stream of elements a1, a2, a3, ..., an where there can be many repeated elements, you the answer to the following question at any time: How many ai elements have you seen so far?


    You can clearly get an exact answer at any time just by maintaining the mapping from ai to the count of those elements you've seen so far. Recording new observations costs O(1), as does checking the observed count for a given element. However, it costs O(n) space to store this mapping, where n is the number of distinct elements.


    How is Count Sketch is going to help you? As with all probabilistic data structures you sacrifice certainty for space. Count Sketch allows you to select two parameters: accuracy of the results (ε) and probability of bad estimate (δ).

    To do this you select a family of d pairwise-independent hash functions. These complicated words mean that they do not collide too often (in fact, if both hashes map values onto space [0, m] then the probability of collision is approximately 1/m^2). Each of these hash functions maps the values to a space [0, w], so you create a d * w matrix.

    When you read the element, you calculate each of d hashes of this element and update the corresponding values in the sketch. This part is the same for Count Sketch and Count-min Sketch.

    Insomniac nicely explained the idea (calculating expected value) for Count Sketch, so I will just note that with Count-min Sketch everything is even simpler. You just calculate d hashes of the value you want to get and return the smallest of them. Surprisingly this provides a strong accuracy and probability guarantee, which you can find here.

    Increasing the range of the hash functions increases the accuracy of results, while increasing the number of hashes decreases the probability of bad estimate: ε = e/w and δ=1/e^d. Another interesting thing is that the value is always overestimated (if you found the value, it is most probably bigger than the real value, but surely not smaller).

提交回复
热议问题