Bloom Filter Implementation

问题

Using Bloom filter, we will be getting space optimization. The cassandra framework also has an implementation of Bloom Filter. But in detail, how is this space optimization achieved?

回答1:

A bloom filter isn't a "framework". It's really more like simply an algorithm. The implementation ain't very long.

Here's one in Java I've tried (.jar, source code and JavaDoc being all available):

"Stand alone Java implementations of Cuckoo Hashing and Bloom Filters" (you may want to Google for this in case the following link ain't working anymore):

http://lmonson.com/blog/?page_id=99

回答2:

You can understand how it saves space using this example : Lets say I work for Google, in the Chrome team, and I want to add a feature to the browser which notifies the user if the url he has entered is a malicious URL. So I have a dataset of about 1 million malicious URLs, the size of this file being around 25MB. Since the size is quite big, (big in comparison to the size of the browser itself), I store this data on a remote server.

Case 1 : I use a hash function with a hash table. I decide on an efficient hashing function, and run all the 1 million urls through the hashing function to get hash keys. I then make a hash table (an array), where the hash key would give me the index to place that URL. So now once I have hashed and filled the hashing table, I check its size. I have stored all 1 million URLs in the hash table along with they're keys. So the size is at least 25 MB. This hash table, due to its size will be stored on a remote server. When a user comes along and enters a url in the address bar, I need to check if it malicious. Thus I run the url through the hash function (the browser itself can do this) and I get a hash key for that URL. I now have to make a request to my remote server with that hash key, to check the if the particular URL in my hash table with that particular key, is the same as what the user has entered. If yes then it is malicious and if no then it is not malicious. Thus every time the user enters a URL, a request to the remote server has to be made to check if it is a malicious URL. This would take a lot of time and thus make my browser slow.

Case 2 : I use a bloom filter. The entire list of 1 million URLs are run through the bloom filter using multiple hash functions and the respective positions are marked as 1, in a huge array of 0s. Lets say we want a false positive rate of 1%, using a bloom filter calculator (http://hur.st/bloomfilter?n=1000000&p=0.01) , we get the size of the bloom filter required as only 1.13 MB. This small size is expected as, even though the size of the array is huge, we are only storing 1s or 0s and not the URLs as in case of the hash table.This array can be treated as a bit array. That is, since we have only two values 1 and 0, we can set individual bits instead of bytes. This would reduce the space taken by 8 times. This 1.13 MB bloom filter, due to its small size, can be stored in the web browser itself !! Thus when a user comes along and enters a URL, we simply apply the required hash functions (in browser itself), and check all the positions in the bloom filter (which is stored in the browser). A value of 0 in any of the positions tells us that this URL is DEFINITELY NOT in the list of malicious URLs and the user can proceed freely. Thus we did not make a call to the server and hence saved time. A value of 1 tells us that the url MIGHT be in the list of malicious URLS. In these cases we make a call to the remote server and over there we can use some other hash function with some hash table as in the first case to retrieve and check if the url is actually present. Since most of the times, a url is not likely to be a malicious one, the small bloom filter in the browser figures that out and hence saves time by avoiding calls to the remote server. Only in some cases, if the bloom filter tells us that the url MIGHT be malicious , only in those cases we make a call to the server. That 'MIGHT' is 99% right.

So by using a small bloom filter in the browser, we have saved a lot of time as we do not need to make server calls for every url entered.

回答3:

So I have seen this question before, and I used advice above and it turned out to be way to slow for me. So I wrote my own. It is not fully general, but I am sure if somebody is desperate for performance like I am they will make it more general by themselves :)

I used Murmur hash implementation that you can download here: http://d3s.mff.cuni.cz/~holub/sw/javamurmurhash/

The code: package uk.ac.cam.cl.ss958.SpringBoardSimulation;

    import ie.ucd.murmur.MurmurHash;

    import java.util.BitSet;
    import java.util.Random;

    public class FastBloomFilter {

        private final BitSet bs;

        final int [] hashSeeds;

        final int capacity;

        public FastBloomFilter(int slots, int hashFunctions) {
            bs = new BitSet(slots);
            Random r = new Random(System.currentTimeMillis());
            hashSeeds = new int[hashFunctions];
            for (int i=0; i<hashFunctions; ++i) {
                hashSeeds[i] = r.nextInt();
            }
            capacity = slots;
        }

        public void add(int value) {
            byte [] b = new byte[] {
                    (byte)(value >>> 24),
                    (byte)(value >>> 16),
                    (byte)(value >>> 8),
                    (byte)value};
            for (int i=0; i<hashSeeds.length; ++i) {
                int h = MurmurHash.hash32(b, 4, hashSeeds[i]);
                bs.set(Math.abs(h)%capacity, true);
            }
        }

        public void clear() {
            bs.clear();
        }

        public boolean mightContain(int value) {
            byte [] b = new byte[] {
                    (byte)(value >>> 24),
                    (byte)(value >>> 16),
                    (byte)(value >>> 8),
                    (byte)value};
            for (int i=0; i<hashSeeds.length; ++i) {
                int h = MurmurHash.hash32(b, 4, hashSeeds[i]);

                if(!bs.get(Math.abs(h)%capacity)) {
                    return false;


            }

            return true;
        }


        public static void main(String [] args) {
            FastBloomFilter bf = new FastBloomFilter(1000, 10);
            System.out.println("Query for 2000: " + bf.mightContain(2000));
            System.out.println("Adding 2000");
            bf.add(2000);
            System.out.println("Query for 2000: " + bf.mightContain(2000));


        }
    }

回答4:

You can use Bloom filter based on Redis server with Redisson lib. Based on 128-bits HighwayHash. Here is an example:

RBloomFilter<SomeObject> bloomFilter = redisson.getBloomFilter("sample");

// initialize bloom filter once with 
// expectedInsertions = 55000000
// falseProbability = 0.03
bloomFilter.tryInit(55000000L, 0.03);

bloomFilter.add(new SomeObject(someStateHere1));
bloomFilter.add(new SomeObject(someStateHere2));
// does it contain object?
bloomFilter.contains(new SomeObject(someStateHere3));

回答5:

I wrote a short post about implementing a bloom filter using Java 8 features, that I hope is relevant to the issue of space savings. I went a bit further to discuss how to bit slice a collection of bloom filters, when some information retrieval systems would do this, which is relevant to efficiencies when you have lots of bloom filters.

来源：https://stackoverflow.com/questions/4546447/bloom-filter-implementation

标签

java

algorithm

data-structures

space-complexity

bloom-filter