How many hash functions does my bloom filter need?

我的未来我决定 提交于 2019-11-28 17:37:36
Ian Boyd

Given:

  • n: how many items you expect to have in your filter (e.g. 216,553)
  • p: your acceptable false positive rate {0..1} (e.g. 0.01 → 1%)

we want to calculate:

  • m: the number of bits needed in the bloom filter
  • k: the number of hash functions we should apply

The formulas:

m = -n*ln(p) / (ln(2)^2) the number of bits
k = m/n * ln(2) the number of hash functions

In our case:

  • m = -216553*ln(0.01) / (ln(2)^2) = 997263 / 0.48045 = 2,075,686 bits (253 kB)
  • k = m/n * ln(2) = 2075686/216553 * 0.693147 = 6.46 hash functions (7 hash functions)

Note: Any code released into public domain. No attribution required.

f3lix

If you read further down in the Wikipedia article about Bloom filters, then you find a section Probability of false positives. This section explains how the number of hash functions influences the probabilities of false positives and gives you the formula to determine k from the desired expected prob. of false positives.


Quote from the Wikipedia article:

Obviously, the probability of false positives decreases as m (the number of bits in the array) increases, and increases as n (the number of inserted elements) increases. For a given m and n, the value of k (the number of hash functions) that minimizes the probability is

And to have it laid out in a neat little table:

http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html

There is an excellent online bloomfilter calculator.

This interactive bloom filter calculator lets you estimate and find out coefficients for your bloom filter needs. It also shows you graphs to see results visually and provides all the formulas For example, calculations for 216,553 n items with probability p of 0.01:

n = ceil(m / (-k / log(1 - exp(log(p) / k)))) p = pow(1 - exp(-k / (m / n)), k) m = ceil((n * log(p)) / log(1 / pow(2, log(2)))); k = round((m / n) * log(2)); 

Given a number of bits per key you want to "invest", the best k is:

max(1, round(bitsPerKey * log(2))) 

Where max is the higher of the two, round rounds to the nearest integer, log is the natural logarithm (base e).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!