bloom-filter | 易学教程

Why Does a Bloom Filter Need Multiple Hash Functions?

阅读更多关于 Why Does a Bloom Filter Need Multiple Hash Functions?

问题 I don't really understand why a bloom filter requires multiple hash functions (say, SHA and MD5). Why not just make a bigger SHA hash, for example, and then break it up into multiple parts and treat them as separate hashes? Isn't that more efficient in terms of speed? 回答1: The idea is to use several different but simple hash functions. If you're going to use some cryptographic hash function like SHA or MD5 then you could just vary the input to it. Whether it's more efficient depends how

Near Duplicate Detection in Data Streams

阅读更多关于 Near Duplicate Detection in Data Streams

问题 I am currently working on a streaming API that generates a lot of textual content. As expected, the API gives out a lot of duplicates and we also have a business requirement to filter near duplicate data. I did a bit of research on duplicate detection in data streams and read about Stable Bloom Filters. Stable bloom filters are data structures for duplicate detection in data streams with an upper bound on the false positive rate. But, I want to identify near duplicates and I also looked at

Bloomfilter and Cassandra = Why used and why hashed several times?

阅读更多关于 Bloomfilter and Cassandra = Why used and why hashed several times?

I Read this: http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html My Questions: 1.) Is it correct, that Cassandra only uses the bloom filter, to find out the SST (Sorted String Table) which most likely contains the key? As there might be several SSTs and Cassandra does not know in Which SST a key might be? So to speed this up looking in all SSTs bloomfilters are used. Is this correct? (I am trying to understand how cassandra works...) 2.) Why are (as explained in the link above) keys hashed several times? Is it correct that the keys need to be hashed with different Hash

Need memory efficient way to store tons of strings (was: HAT-Trie implementation in java)

阅读更多关于 Need memory efficient way to store tons of strings (was: HAT-Trie implementation in java)

问题 I am working with a large set (5-20 million) of String keys (average length 10 chars) which I need to store in an in memory data structure that supports the following operation in constant time or near constant time: // Returns true if the input is present in the container, false otherwise public boolean contains(String input) Java's Hashmap is proving to be more than satisfactory as far as throughput is concerned but is taking up a lot of memory. I am looking for a solution that is memory

Bloom Filter Implementation

阅读更多关于 Bloom Filter Implementation

Using Bloom filter, we will be getting space optimization. The cassandra framework also has an implementation of Bloom Filter. But in detail, how is this space optimization achieved? A bloom filter isn't a "framework". It's really more like simply an algorithm. The implementation ain't very long. Here's one in Java I've tried ( .jar , source code and JavaDoc being all available): "Stand alone Java implementations of Cuckoo Hashing and Bloom Filters" (you may want to Google for this in case the following link ain't working anymore): http://lmonson.com/blog/?page_id=99 You can understand how it

How to map hashfunction output to bloomfilter indices?

阅读更多关于 How to map hashfunction output to bloomfilter indices?

问题 Can anyone help me by providing an outline on how the hash function output is mapped to bloom filter indices? Here is an overview on bloomfilters. 回答1: an outline on how the hash function output is mapped to a bloom filter indices For each of the k hash functions in use, they map onto a bit in the bloom filter just as hashes map onto hash buckets in a hash table. So, very commonly you might have say a hash function generating 32 bit integers, then use the modulus % operator to get a bit index

Bloom filter usage

阅读更多关于 Bloom filter usage

问题 I am struggling to understand the usefulness of the bloom filter. I get its underlying logic, space compaction, fast lookups, false positives etc. I just cannot put that concept into a real-life situation as being beneficial. One frequent application is use of bloom filters in web caching. We use bloom filter to determine whether a given URL is in the cache or not. Why don't we simply access the cache to determine that? If we get a yes, we still need to go to cache to retrieve the webpage

How to map hashfunction output to bloomfilter indices?

阅读更多关于 How to map hashfunction output to bloomfilter indices?

Can anyone help me by providing an outline on how the hash function output is mapped to bloom filter indices? Here is an overview on bloomfilters . an outline on how the hash function output is mapped to a bloom filter indices For each of the k hash functions in use, they map onto a bit in the bloom filter just as hashes map onto hash buckets in a hash table. So, very commonly you might have say a hash function generating 32 bit integers, then use the modulus % operator to get a bit index 0 << i < n where n is the number of bits in your bloom filter. To make this very concrete, let's say a

Bloom filter usage

阅读更多关于 Bloom filter usage

I am struggling to understand the usefulness of the bloom filter. I get its underlying logic, space compaction, fast lookups, false positives etc. I just cannot put that concept into a real-life situation as being beneficial. One frequent application is use of bloom filters in web caching. We use bloom filter to determine whether a given URL is in the cache or not. Why don't we simply access the cache to determine that? If we get a yes, we still need to go to cache to retrieve the webpage (which might not be there), but in case of a no, we could have got the same answer using the cache (which

python bit array (performant)

阅读更多关于 python bit array (performant)

I'm designing a bloom filter and I'm wondering what the most performant bit array implementation is in Python. The nice thing about Python is that it can handle arbitrary length integers out of the box and that's what I use now, but I don't know enough about Python internals to know if that's the most performant way to do it in Python. I found bitarray but it handles a lot of other things like slicing, which I don't need. I only need the & and | and << operations. The built-in int is pretty nicely optimized, and it already supports & , | , and << . There's at least one alternative