Bloomfilter and Cassandra = Why used and why hashed several times?

谁说我不能喝 提交于 2019-12-04 04:32:51

1) Yes, see this in the cassandra wiki,

Cassandra uses bloom filters to save IO when performing a key lookup: each SSTable has a bloom filter associated with it that Cassandra checks before doing any disk seeks, making queries for keys that don't exist almost free

The columns of a key may be spread out in several sstables. If it wasn't for bloom filters, every read of a key would have to read every sstable, which is prohibitively expensive. By using bloom filters, cassandra almost always only has to look in the sstables which contain data for that key.

2) This might give you a better understanding of bloom filters. You hash k times to give independent positions in an array of size m. For example, if A and B are the elements in the set, and you have k = 2, your hash functions are md5 and sha1, and m = 16, you can do

md5(A) % m = 7
sha1(A) % m = 12

md5(B)  % m = 15
sha1(B)  % m = 12

This gives you m[7], m[12] and m[15] are true for the filter.

To see if C is in the set, you do

md5(C)  % m = 8
sha1(C) % m = 12

Since m[8] is false, you know C is not in the set, however, for D

md5(D)  % m = 7
sha1(D)  % m = 15

Both m[7] and m[15] is true, but D is not in the set, so D is a false positive.

This does cost cpu cycles, but you are trading cpu cycles for reduced io, which makes sense for cassandra.

3) The article doesn't mention md5. md5 is randomly distributed, and I would guess the difference between md5 and sha-1 for bloom filters is not large.

As an addition to the 3rd point of the answer by sbridges.

MD5 and SHA-1 are randomly distributed but are cryptographic hash functions. While implementing any type of bloom filter, the only bottleneck in the performance is time taken for hashing. This is why, cryptographic functions when used decrease the performance of the application.

It is recommended to use non-cryptographic hash functions like Murmur hash. This paper, recommends to construct and hash function like:

g(x) = h1(x) + i * h2(x) 

where g(x) is the new hash function, h1 and h2 are standard hash functions and i is the number of iteration ranging from 0 to k.

By using this technique, the same performance can be reached with two hash functions (assuming k > 2).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!