问题
Wikipedia says:
An empty Bloom filter is a bit array of m bits, all set to 0. There must also be k different hash functions defined, each of which maps or hashes some set element to one of the m array positions with a uniform random distribution.
I read the article, but what I don't understand is how k is determined. Is it a function of the table size?
Also, in hash tables I've written I used a simple but effective algorithm for automatically growing the hash's size. Basically, if ever more than 50% of the buckets in the table were filled, I would double the size of the table. I suspect you might still want to do this with a bloom filter to reduce false positives. Correct ?
回答1:
Given:
n
: how many items you expect to have in your filter (e.g. 216,553)p
: your acceptable false positive rate {0..1} (e.g.0.01
→ 1%)
we want to calculate:
m
: the number of bits needed in the bloom filterk
: the number of hash functions we should apply
The formulas:
m = -n*ln(p) / (ln(2)^2)
the number of bitsk = m/n * ln(2)
the number of hash functions
In our case:
m
=-216553*ln(0.01) / (ln(2)^2)
=997263 / 0.48045
=2,075,686
bits (253 kB)k
=m/n * ln(2)
=2075686/216553 * 0.693147
=6.46
hash functions (7 hash functions)
Note: Any code released into public domain. No attribution required.
回答2:
If you read further down in the Wikipedia article about Bloom filters, then you find a section Probability of false positives. This section explains how the number of hash functions influences the probabilities of false positives and gives you the formula to determine k from the desired expected prob. of false positives.
Quote from the Wikipedia article:
Obviously, the probability of false positives decreases as m (the number of bits in the array) increases, and increases as n (the number of inserted elements) increases. For a given m and n, the value of k (the number of hash functions) that minimizes the probability is
回答3:
And to have it laid out in a neat little table:
http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html
回答4:
There is an excellent online bloomfilter calculator.
This interactive bloom filter calculator lets you estimate and find out coefficients for your bloom filter needs. It also shows you graphs to see results visually and provides all the formulas For example, calculations for 216,553 n
items with probability p
of 0.01:
n = ceil(m / (-k / log(1 - exp(log(p) / k)))) p = pow(1 - exp(-k / (m / n)), k) m = ceil((n * log(p)) / log(1 / pow(2, log(2)))); k = round((m / n) * log(2));
回答5:
Given a number of bits per key you want to "invest", the best k is:
max(1, round(bitsPerKey * log(2)))
Where max
is the higher of the two, round
rounds to the nearest integer, log
is the natural logarithm (base e).
来源:https://stackoverflow.com/questions/658439/how-many-hash-functions-does-my-bloom-filter-need