bloom-filter | 易学教程

Using hash functions with Bloom filters

阅读更多关于 Using hash functions with Bloom filters

问题 A bloom filter uses a hash function (or many) to generate a value between 0 and m given an input string X. My question is how to you use a hash function to generate a value in this way, for example an MD5 hash is typically represented by a 32 length hex string, how would I use an MD5 hashing algorithm to generate a value between 0 and m where I can specify m? I'm using Java at the moment so an example of to do this with the MessageDigest functionality it offers would be great, though just a

python bit array (performant)

阅读更多关于 python bit array (performant)

问题 I'm designing a bloom filter and I'm wondering what the most performant bit array implementation is in Python. The nice thing about Python is that it can handle arbitrary length integers out of the box and that's what I use now, but I don't know enough about Python internals to know if that's the most performant way to do it in Python. I found bitarray but it handles a lot of other things like slicing, which I don't need. I only need the & and | and << operations. 回答1: The built-in int is

Need memory efficient way to store tons of strings (was: HAT-Trie implementation in java)

阅读更多关于 Need memory efficient way to store tons of strings (was: HAT-Trie implementation in java)

I am working with a large set (5-20 million) of String keys (average length 10 chars) which I need to store in an in memory data structure that supports the following operation in constant time or near constant time: // Returns true if the input is present in the container, false otherwise public boolean contains(String input) Java's Hashmap is proving to be more than satisfactory as far as throughput is concerned but is taking up a lot of memory. I am looking for a solution that is memory efficient and still supports a throughput that is decent (comparable with or nearly as good as hashing).

Opposite of Bloom filter?

阅读更多关于 Opposite of Bloom filter?

I'm trying to optimize a piece of software which is basically running millions of tests. These tests are generated in such a way that there can be some repetitions. Of course, I don't want to spend time running tests which I already ran if I can avoid it efficiently. So, I'm thinking about using a Bloom filter to store the tests which have been already ran. However, the Bloom filter errs on the unsafe side for me. It gives false positives. That is, it may report that I've ran a test which I haven't. Although this could be acceptable in the scenario I'm working on, I was wondering if there's an

Which hash functions to use in a Bloom filter

阅读更多关于 Which hash functions to use in a Bloom filter

I've got the following question about choosing hash functions for Bloom filters: Which functions to use? In nearly every document/paper you can read that the hash functions used in a Bloom filter should be independent and uniformly distributed. I know what is meant by this (independent and uniformly distributed), but I'm having trouble to find a argumentation or a discussion, which hash functions fulfill those requirements and are therefore suitable. In a lot of posts I've read about suggestions for the usage of the FNV or Murmur hash function , but not why (or at least without a proof) they

How many hash functions does my bloom filter need?

阅读更多关于 How many hash functions does my bloom filter need?

Wikipedia says: An empty Bloom filter is a bit array of m bits, all set to 0. There must also be k different hash functions defined, each of which maps or hashes some set element to one of the m array positions with a uniform random distribution. I read the article, but what I don't understand is how k is determined. Is it a function of the table size? Also, in hash tables I've written I used a simple but effective algorithm for automatically growing the hash's size. Basically, if ever more than 50% of the buckets in the table were filled, I would double the size of the table. I suspect you

Modern, high performance bloom filter in Python?

阅读更多关于 Modern, high performance bloom filter in Python?

问题 I'm looking for a production quality bloom filter implementation in Python to handle fairly large numbers of items (say 100M to 1B items with 0.01% false positive rate). Pybloom is one option but it seems to be showing its age as it throws DeprecationWarning errors on Python 2.5 on a regular basis. Joe Gregorio also has an implementation. Requirements are fast lookup performance and stability. I'm also open to creating Python interfaces to particularly good c/c++ implementations, or even to

What is the advantage to using bloom filters?

阅读更多关于 What is the advantage to using bloom filters?

I am reading up on bloom filters and they just seem silly. Anything you can accomplish with a bloom filter, you could accomplish in less space, more efficiently, using a single hash function rather than multiple, or that's what it seems. Why would you use a bloom filter and how is it useful? From Wikipedia : Bloom filters have a strong space advantage over other data structures for representing sets, such as self-balancing binary search trees, tries, hash tables, or simple arrays or linked lists of the entries. Most of these require storing at least the data items themselves, which can require

Which hash functions to use in a Bloom filter

阅读更多关于 Which hash functions to use in a Bloom filter

问题 I've got the following question about choosing hash functions for Bloom filters: Which functions to use? In nearly every document/paper you can read that the hash functions used in a Bloom filter should be independent and uniformly distributed. I know what is meant by this (independent and uniformly distributed), but I'm having trouble to find a argumentation or a discussion, which hash functions fulfill those requirements and are therefore suitable. In a lot of posts I've read about

How many hash functions does my bloom filter need?

阅读更多关于 How many hash functions does my bloom filter need?

问题 Wikipedia says: An empty Bloom filter is a bit array of m bits, all set to 0. There must also be k different hash functions defined, each of which maps or hashes some set element to one of the m array positions with a uniform random distribution. I read the article, but what I don't understand is how k is determined. Is it a function of the table size? Also, in hash tables I've written I used a simple but effective algorithm for automatically growing the hash's size. Basically, if ever more